2018-08-08

2018-08-08-Scrapy爬虫Demo_抓取网页中的图片

Scrapy介绍

Scrapy is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival.

Scrapy安装

https://doc.scrapy.org/en/latest/intro/overview.html

创建并修改爬虫

下载Scrapy官方的Demo

1 2	$ git clone https://github.com/scrapy/quotesbot $ cd quotesbot

修改 quotesbot/spiders/toscrape-css.py 文件内空，如下：

# -*- coding: utf-8 -*-
# Usage 使用方法: scrapy crawl toscrape-css -o xxx.csv
# //img//@data-src' 图片类似于这样的 <img data-src='http://xxxx/images/2018.jpg'> in start_urls
# //input//@data-src' 图片类似于这样的 <input data-src='http://xxxx/images/2018.jpg' type='image'> in start_urls
# 因为考虑反爬虫的问题，建议URL地址已抓出来后，直接将所有地址放到迅雷里打包下载
import scrapy
import os
import urllib

class ToScrapeCSSSpider(scrapy.Spider):
    name = "toscrape-css"
    start_urls = [
        'https://xxx/htm_data/x/2018/xxxxxx.html',
    ]

    def parse(self, response):
        for quote in response.xpath('//input//@data-src').extract():
            print quote

            # 加 -o 参数时导出到文件时，会用到
            # yield {
            #     'url' : quote
            # }

            # （直接下载图片到当前目录中，反爬严格的，可能会无法下载下来）
            #  因为考虑反爬虫的问题，建议URL地址已抓出来后，直接将所有地址放到迅雷里打包下载
            # if quote:
            #     filename = os.path.basename(quote)
            #     print filename
            #     filepath=os.path.join("/Users/xxx/Desktop",filename)
            #     urllib.urlretrieve(quote,filepath)

            # 这个注释掉，需要时再用
            # yield scrapy.Request(quote)

使用

Usage 使用方法: scrapy crawl toscrape-css -o xxx.csv

一些注意事项

1, 因为考虑反爬虫的问题，建议URL地址已抓出来后，直接将所有地址放到迅雷里打包下载
2, Scrapy爬虫出现Forbidden by robots.txt
关闭scrapy自带的ROBOTSTXT_OBEY功能，在setting找到这个变量，设置为False即可解决
ROBOTSTXT_OBEY = False