public abstract class AutoParseCrawler extends Crawler implements Executor, Visitor, Requester
| 限定符和类型 | 字段和说明 |
|---|---|
protected boolean |
autoParse
是否自动抽取符合正则的链接并加入后续任务
|
static org.slf4j.Logger |
LOG |
protected boolean |
parseImg |
protected RegexRule |
regexRule
URL正则约束
|
protected Requester |
requester |
protected Visitor |
visitor |
dbManager, executeInterval, executor, fetcher, forcedSeeds, maxExecuteCount, nextFilter, resumable, RUNNING, seeds, status, STOPED, threads, topN| 构造器和说明 |
|---|
AutoParseCrawler(boolean autoParse) |
| 限定符和类型 | 方法和说明 |
|---|---|
void |
addRegex(String urlRegex)
添加URL正则约束
|
protected void |
afterParse(Page page,
CrawlDatums next) |
void |
execute(CrawlDatum datum,
CrawlDatums next) |
RegexRule |
getRegexRule()
获取正则规则
|
Requester |
getRequester() |
HttpResponse |
getResponse(CrawlDatum crawlDatum) |
Visitor |
getVisitor()
获取Visitor
|
boolean |
isAutoParse() |
boolean |
isParseImg() |
protected void |
parseLink(Page page,
CrawlDatums next) |
void |
setAutoParse(boolean autoParse)
设置是否自动抽取符合正则的链接并加入后续任务
|
void |
setParseImg(boolean parseImg) |
void |
setRegexRule(RegexRule regexRule)
设置正则规则
|
void |
setRequester(Requester requester) |
void |
setVisitor(Visitor visitor)
设置Visitor
|
addSeed, addSeed, addSeed, addSeed, addSeed, addSeed, addSeed, addSeed, addSeed, addSeed, addSeed, addSeed, getDBManager, getExecuteInterval, getExecutor, getMaxExecuteCount, getNextFilter, getThreads, getTopN, inject, injectForcedSeeds, isResumable, setDBManager, setExecuteInterval, setExecutor, setMaxExecuteCount, setNextFilter, setResumable, setThreads, setTopN, start, stop, toStringpublic static final org.slf4j.Logger LOG
protected boolean autoParse
protected boolean parseImg
protected Visitor visitor
protected Requester requester
protected RegexRule regexRule
public HttpResponse getResponse(CrawlDatum crawlDatum) throws Exception
getResponse 在接口中 RequesterExceptionpublic void execute(CrawlDatum datum, CrawlDatums next) throws Exception
protected void afterParse(Page page, CrawlDatums next)
protected void parseLink(Page page, CrawlDatums next)
public void addRegex(String urlRegex)
urlRegex - URL正则约束public boolean isAutoParse()
public void setAutoParse(boolean autoParse)
autoParse - 是否自动抽取符合正则的链接并加入后续任务public RegexRule getRegexRule()
public void setRegexRule(RegexRule regexRule)
regexRule - 正则规则public Visitor getVisitor()
public void setVisitor(Visitor visitor)
visitor - Visitorpublic Requester getRequester()
public void setRequester(Requester requester)
public boolean isParseImg()
public void setParseImg(boolean parseImg)
Copyright © 2017. All Rights Reserved.