Spider_man_UML
2018-07-20 09:53:07 0 举报
spidermanuml
作者其他创作
大纲/内容
继承
MonitorMiddleware
RabbitMQSpider爬虫类
data
RabbitMQPipeline
CounterMiddleware
CookieMiddleware
Scrapy爬虫础类
+ name:string = None+ custom_settings:dict = None
+ logger+ log+ from_crawler+ set_crawler+ _set_crawler+ start_requests+ make_requests_from_url+ update_settings+ hanles_request
定向爬虫有:APP,论坛,知乎,自媒体等爬虫
BaseSpider
ProxyMiddleware
BaseItem
+ 平台唯一ID uuid+ 类型 type+ 标题 title+ 访问 URL url + 内容 content + 发布时间 pubTime+ 抓取时间 fetchTime + 采集源信息 origin+ 情感(数据回填需要) emotion+ 内核类 kernelSentences
处理
产生
BaseCrawlSpider
CommonNewsScannerSpider
+ name:String+ custom_settings:Dict+ rules_arr:Array+ rules:Array
中间件
Request
+ encoding:String='utf-8'+ method:String='GET'+ url:String+ body:binary+ callback:function+ errback:function+ cookies:Dict+ headers:Dict+ dont_filter:blooean=False+ _meta:Dict+ flags
+ meta:Dict(meta)+ get_url:String(url)+ _set_url(url)+ _get_body:binary+ _set_body(body)+ encoding:String+ __str__:String+ copy+ replace:class
下载器
具体的Item实现类
UserAgentMiddleware
item
RabbitMQMixin类
+ rabbitmq_key:dict = None+ rabbitmq_batch_size:integer = None+ server:String = None
+ start_requests+ setup_rabbitmq(crawler)+ next_requests+ make_request_from_data(data):Request+ make_requests_from_url:Request+ schedule_next_request+ spider_idle:DontCloseSpider
CrawlSpider
+ rules:tuple
定向爬虫类
+ name:String = 唯一且不为空+ default_page_num:integer = 10+ default_page_size:integer = 25+ profile_api_url:String+ custom_settings:dict
+ make_request_from_data(task):Dict(requests)+ parse(response)
BasePipeline
BaseMixin
+ retry_times:integer = 10+ is_login_spider:boolean = False
RabbitMQCrawlSpider
具体的item有:answer,dianping,news,post,comment
请求
DataBasePipeline
收藏
收藏
0 条评论
回复 删除
下一页