2020-02-13-Scrapy爬虫Demo_抓取出版社2020年春季学期使用教科书

(有删减)

源头及原因

http://xxx.com/jc/
出版社2020年春季学期使用教科书,方便在家自主学习。所以下载下来本地随时使用

Scrapy介绍

Scrapy is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival.

Scrapy安装

https://doc.scrapy.org/en/latest/intro/overview.html

爬虫创建及添加

创建   
scrapy startproject jiaocai  
生成爬虫  
cd jiaocai  
scrapy genspider jiaocai2020 http://xxx.com/jc  

修改 jiaocai/spiders/jiaocai2020.py 文件内容,如下:

# -*- coding: utf-8 -*-  
# Usage 使用方法: scrapy crawl jiaocai2020 -o xxx.csv  
# //img//@data-src' 图片类似于这样的 <img data-src='http://xxxx/images/2018.jpg'> in start_urls  
# //input//@data-src' 图片类似于这样的 <input data-src='http://xxxx/images/2018.jpg' type='image'> in start_urls  
# 因为考虑反爬虫的问题,建议URL地址已抓出来后,直接将所有地址放到迅雷里打包下载  

import scrapy  
from scrapy import Request  

class Jiaocai2020Spider(scrapy.Spider):  
    name = 'jiaocai2020'  
    # allowed_domains = ['http://xxx.com']  
    start_urls = ['http://xxx.com/jc/']  

    def parse(self, response):  
        print (response.url)  
        # for quote1 in response.xpath('//a/@href').extract():  
        # <li class="fl"><a href="./ywjygjkcjc/xdjc/" target="_blank">教科书</a></li>  
        for quote1 in response.xpath('//li[@class="fl"]//a/@href').extract():  
            # href="./ywjygjkcjc/xxdfjsys/"  
            ur11 = "http://xxx.com/jc/" + quote1[2:]  
            print (ur11)  
            # 'http://xxx.com/jc/zzwhjc/zzjyghjsys/',  
            yield scrapy.Request(ur11, callback=self.calbak, meta=None)  
            # yield scrapy.Request(ur11, callback=self.calbak, meta={'ur11': ur11})  


    def calbak(self, response):  
        print (response.url)  
        # kkp = response.meta['ur11']  
        # print (kkp)  
        # if "error" in response.body:  
        #     self.logger.error("error1r")  
        #     return  
        # <a href="./202002/P020200210281142511137.pdf" target="_blank" title="下载" class="btn_type_dl">下载</a>  
        for quote2 in response.xpath('//a[@class="btn_type_dl"]/@href').extract():  
            ur12 = response.url + quote2[2:]  
            # http://xxx.com/jc/zzwhjc/zzjyghjsys/202002/P020200211707082058148.pdf  
            print (ur12)  
            # 加 -o 参数时导出到文件时,会用到  
            yield {  
                'url' : ur12  
            }  

            # (直接下载图片到当前目录中,反爬严格的,可能会无法下载下来)  
            #  因为考虑反爬虫的问题,建议URL地址已抓出来后,直接将所有地址放到迅雷里打包下载  
            # if quote:  
            #     filename = os.path.basename(quote)  
            #     print filename  
            #     filepath=os.path.join("/Users/xxx/Desktop",filename)  
            #     urllib.urlretrieve(quote,filepath)  

            # 这个注释掉,需要时再用  
            # yield scrapy.Request(quote)  

使用

Usage 使用方法: scrapy crawl jiaocai2020 -o xxx.csv

一些注意事项

1, 因为考虑反爬虫的问题,建议URL地址已抓出来后,直接将所有地址放到迅雷里打包下载
2, Scrapy爬虫出现Forbidden by robots.txt
关闭scrapy自带的ROBOTSTXT_OBEY功能,在setting找到这个变量,设置为False即可解决
ROBOTSTXT_OBEY = False
3, Settings.py文件的一些修改项

USER_AGENT = "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50"  
ROBOTSTXT_OBEY = False  
CONCURRENT_REQUESTS = 1  
DOWNLOAD_DELAY = 1  
DEFAULT_REQUEST_HEADERS = {  
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',  
  'Accept-Language': 'en',
}  
AUTOTHROTTLE_START_DELAY = 5  
AUTOTHROTTLE_MAX_DELAY = 60  

创建 main.py 自动化执行 (未完成)

创建 main.py 自动化执行, 点击运行即可爬数据, 文件放在项目根目录(即scrapy.cfg所在目录)
main.py代码如下

import scrapy  
from scrapy.cmdline import execute  
import os,sys  
sys.path.append(os.path.dirname(os.path.basename(__file__))) #注意,是爬虫名,不是项目名  
execute(['scrapy','crawl','jiaocai2020'])  

# from scrapy import cmdline  
# cmdline.execute("scrapy crawl jiaocai2020".split())   

参考链

https://www.jianshu.com/p/8d353c7cf606
https://zhuanlan.zhihu.com/p/32458936