(有删减)
源头及原因
http://xxx.com/jc/
出版社2020年春季学期使用教科书,方便在家自主学习。所以下载下来本地随时使用
Scrapy介绍
Scrapy is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival.
Scrapy安装
https://doc.scrapy.org/en/latest/intro/overview.html
爬虫创建及添加
创建
scrapy startproject jiaocai
生成爬虫
cd jiaocai
scrapy genspider jiaocai2020 http://xxx.com/jc
修改 jiaocai/spiders/jiaocai2020.py 文件内容,如下:
# -*- coding: utf-8 -*-
# Usage 使用方法: scrapy crawl jiaocai2020 -o xxx.csv
# //img//@data-src' 图片类似于这样的 <img data-src='http://xxxx/images/2018.jpg'> in start_urls
# //input//@data-src' 图片类似于这样的 <input data-src='http://xxxx/images/2018.jpg' type='image'> in start_urls
# 因为考虑反爬虫的问题,建议URL地址已抓出来后,直接将所有地址放到迅雷里打包下载
import scrapy
from scrapy import Request
class Jiaocai2020Spider(scrapy.Spider):
name = 'jiaocai2020'
# allowed_domains = ['http://xxx.com']
start_urls = ['http://xxx.com/jc/']
def parse(self, response):
print (response.url)
# for quote1 in response.xpath('//a/@href').extract():
# <li class="fl"><a href="./ywjygjkcjc/xdjc/" target="_blank">教科书</a></li>
for quote1 in response.xpath('//li[@class="fl"]//a/@href').extract():
# href="./ywjygjkcjc/xxdfjsys/"
ur11 = "http://xxx.com/jc/" + quote1[2:]
print (ur11)
# 'http://xxx.com/jc/zzwhjc/zzjyghjsys/',
yield scrapy.Request(ur11, callback=self.calbak, meta=None)
# yield scrapy.Request(ur11, callback=self.calbak, meta={'ur11': ur11})
def calbak(self, response):
print (response.url)
# kkp = response.meta['ur11']
# print (kkp)
# if "error" in response.body:
# self.logger.error("error1r")
# return
# <a href="./202002/P020200210281142511137.pdf" target="_blank" title="下载" class="btn_type_dl">下载</a>
for quote2 in response.xpath('//a[@class="btn_type_dl"]/@href').extract():
ur12 = response.url + quote2[2:]
# http://xxx.com/jc/zzwhjc/zzjyghjsys/202002/P020200211707082058148.pdf
print (ur12)
# 加 -o 参数时导出到文件时,会用到
yield {
'url' : ur12
}
# (直接下载图片到当前目录中,反爬严格的,可能会无法下载下来)
# 因为考虑反爬虫的问题,建议URL地址已抓出来后,直接将所有地址放到迅雷里打包下载
# if quote:
# filename = os.path.basename(quote)
# print filename
# filepath=os.path.join("/Users/xxx/Desktop",filename)
# urllib.urlretrieve(quote,filepath)
# 这个注释掉,需要时再用
# yield scrapy.Request(quote)
使用
Usage 使用方法: scrapy crawl jiaocai2020 -o xxx.csv
一些注意事项
1, 因为考虑反爬虫的问题,建议URL地址已抓出来后,直接将所有地址放到迅雷里打包下载
2, Scrapy爬虫出现Forbidden by robots.txt
关闭scrapy自带的ROBOTSTXT_OBEY功能,在setting找到这个变量,设置为False即可解决
ROBOTSTXT_OBEY = False
3, Settings.py文件的一些修改项
USER_AGENT = "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50"
ROBOTSTXT_OBEY = False
CONCURRENT_REQUESTS = 1
DOWNLOAD_DELAY = 1
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
}
AUTOTHROTTLE_START_DELAY = 5
AUTOTHROTTLE_MAX_DELAY = 60
创建 main.py 自动化执行 (未完成)
创建 main.py 自动化执行, 点击运行即可爬数据, 文件放在项目根目录(即scrapy.cfg所在目录)
main.py代码如下
import scrapy
from scrapy.cmdline import execute
import os,sys
sys.path.append(os.path.dirname(os.path.basename(__file__))) #注意,是爬虫名,不是项目名
execute(['scrapy','crawl','jiaocai2020'])
# from scrapy import cmdline
# cmdline.execute("scrapy crawl jiaocai2020".split())
参考链
https://www.jianshu.com/p/8d353c7cf606
https://zhuanlan.zhihu.com/p/32458936