站群软件之数据来源:问答平台采集器的底层采集脚本详解
该问答平台聚合采集器从2021年开发至今,更新迭代了好几代现在已经趋于完美状态,实现导入长尾词进去就快速采集出文章的效果。
下面就详细讲解下怎么设计的,如何从0开始的。
3045.articles0 (Detached)
3033.send (Detached)
3018.chk_fanyi (Detached)
3004.fanyi (Detached)
2990.chk_fanyi_ok (Detached)
2975.trans (Detached)
2962.api2 (Detached)
2951.api (Detached)
2938.detail29 (Detached)
2928.detail28 (Detached)
2913.detail27 (Detached)
2897.detail26 (Detached)
2882.detail25 (Detached)
2871.detail24 (Detached)
2856.detail23 (Detached)
2845.detail22 (Detached)
2832.detail21 (Detached)
2819.detail20 (Detached)
2807.detail19 (Detached)
2789.detail18 (Detached)
2778.detail17 (Detached)
2765.detail16 (Detached)
2754.detail15 (Detached)
2741.detail14 (Detached)
2730.detail13 (Detached)
2718.detail12 (Detached)
2700.detail11 (Detached)
2687.detail10 (Detached)
2674.detail9 (Detached)
2661.detail8 (Detached)
2648.detail7 (Detached)
2637.detail6 (Detached)
2622.detail5 (Detached)
2607.detail4 (Detached)
2591.detail3 (Detached)
2577.detail2 (Detached)
2566.detail1 (Detached)
2553.detail (Detached)
2540.caiji29 (Detached)
2527.caiji28 (Detached)
2514.caiji27 (Detached)
2500.caiji26 (Detached)
2484.caiji25 (Detached)
2473.caiji24 (Detached)
2460.caiji23 (Detached)
2449.caiji22 (Detached)
2433.caiji21 (Detached)
2420.caiji20 (Detached)
2407.caiji19 (Detached)
2392.caiji18 (Detached)
2379.caiji17 (Detached)
2366.caiji16 (Detached)
2355.caiji15 (Detached)
2345.caiji14 (Detached)
2331.caiji13 (Detached)
2320.caiji12 (Detached)
2305.caiji11 (Detached)
2280.caiji10 (Detached)
2247.caiji9 (Detached)
2221.caiji8 (Detached)
2197.caiji7 (Detached)
2169.caiji6 (Detached)
2120.caiji5 (Detached)
2060.caiji4 (Detached)
1995.caiji3 (Detached)
1900.caiji2 (Detached)
1804.caiji1 (Detached)
1726.caiji (Detached)
1650.dr (Detached)
1548.kill (Detached)
1467.idsok (Detached)
1395.ids (Detached)
1275.chk_detail (Detached)
1152.chk_caiji (Detached)
1064.chk_article (Detached)
1017.article (Detached)
目前开启的所有任务由上面的进程组成,
新建一个采集任务后,,
程序会自动新建几个表
其中t1_caiji_config是存储要采集的关键词的。
chk_caiji 这个进程负责检测新数据,然后发送到caiji1-29守护进程去采集搜索结果列表页。
采集完成后初步筛选出文章页的url,
再进一步发送给detail1-29进程采集正文,
采集的时候还要涉及到切换ip,时刻检测采集进度最终合成一篇文章出来。
最终导入到t1_articles,再最后发送给站群系统的文章库中。。。
最终形成这么一个好玩意!功能开发之复杂艰辛,但是使用起来爽歪歪~