public abstract class BreadthCrawler extends AutoParseCrawler
autoParse, LOG, parseImg, regexRule, requester, visitordbManager, executeInterval, executor, fetcher, forcedSeeds, maxExecuteCount, nextFilter, resumable, RUNNING, seeds, status, STOPED, threads, topN| 构造器和说明 |
|---|
BreadthCrawler(String crawlPath,
boolean autoParse)
构造一个基于伯克利DB的爬虫
伯克利DB文件夹为crawlPath,crawlPath中维护了历史URL等信息
不同任务不要使用相同的crawlPath
两个使用相同crawlPath的爬虫并行爬取会产生错误
|
addRegex, afterParse, execute, getRegexRule, getRequester, getResponse, getVisitor, isAutoParse, isParseImg, parseLink, setAutoParse, setParseImg, setRegexRule, setRequester, setVisitoraddSeed, addSeed, addSeed, addSeed, addSeed, addSeed, addSeed, addSeed, addSeed, addSeed, addSeed, addSeed, getDBManager, getExecuteInterval, getExecutor, getMaxExecuteCount, getNextFilter, getThreads, getTopN, inject, injectForcedSeeds, isResumable, setDBManager, setExecuteInterval, setExecutor, setMaxExecuteCount, setNextFilter, setResumable, setThreads, setTopN, start, stop, toStringpublic BreadthCrawler(String crawlPath, boolean autoParse)
crawlPath - 伯克利DB使用的文件夹autoParse - 是否根据设置的正则自动探测新URLCopyright © 2017. All Rights Reserved.