上一篇介绍了使用WebCollector的手动分类,本文继续上一篇的内容介绍自动分类。

  1. 手动模式是指手动提取标签上绑定的URL,然后根据不同标签上绑定的URL可以判断URL是什么类型,或者说有什么共同点,给他一个相同标识进行分类。后面相同标识的URL将会走相同的逻辑。适合于在循环列表中提取URL,例如文章列表。
  2. 自动模式是指不必关注URL是从a标签还是img标签中获得,我们只需要关注URL本身。比如URL的后缀是jpg,则是图片。再比如CSDN的文章内容页的URL中包含details关键字,那么我们只需要写一个正则表达式匹配这类URL即可,后面根据满足不同的正则表达式走不同的逻辑。我们不需要手动从某个a标签上提取链接,只需要通过正则表达式抓取链接即可,我们需要根据URL的结构上的规律进行分类。但是缺点是可能会提取一堆没有用的URL。

如果以+开头的正在表达式为正正则,以-开头的为反正则。没有加号或减号的也为正正则。
代码展示:

import cn.edu.hfut.dmic.webcollector.model.CrawlDatums;
import cn.edu.hfut.dmic.webcollector.model.Page;
import cn.edu.hfut.dmic.webcollector.plugin.rocks.BreadthCrawler;
import xin.jiangqiang.utils.ImgUtils;

/**
 * @author jiangqiang
 * @date 2020/12/10 14:09
 */
public class MyAutoCrawler extends BreadthCrawler {
    /**
     * @param crawlPath crawlPath is the path of the directory which maintains
     *                  information of this crawler
     * @param autoParse if autoParse is true,BreadthCrawler will auto extract
     *                  links which match regex rules from pag
     */
    public MyAutoCrawler(String crawlPath, boolean autoParse) {
        super(crawlPath, autoParse);
        /*start pages*/
        this.addSeed("https://blog.jiangqiang.xin");
        //抓取图片资源
        this.addRegex("https://resource.jiangqiang.xin/halo-blog/.*");
        //抓取文章
        this.addRegex("https://blog.jiangqiang.xin/archives/.*");
        setThreads(50);
        getConf().setTopN(100);
    }

    @Override
    public void visit(Page page, CrawlDatums next) {
        String url = page.url();
        //文章内容页的链接地址
        if (page.matchUrl("https://blog.jiangqiang.xin/archives/.*")) {
            System.out.println("title:" + page.select(".entry-header>.entry-title").first().text());
            System.out.println("content:" + page.select("article .entry-content").first().text());
        } else if (page.matchUrl("https://resource.jiangqiang.xin/halo-blog/.*")) {
            ImgUtils.save(page, "crawl/images");
        }
    }

    public static void main(String[] args) throws Exception {
        //第二个参数为true
        MyAutoCrawler crawler = new MyAutoCrawler("crawl", true);
        crawler.start(4);
    }
}

注解版代码:

import cn.edu.hfut.dmic.webcollector.model.CrawlDatums;
import cn.edu.hfut.dmic.webcollector.model.Page;
import cn.edu.hfut.dmic.webcollector.plugin.rocks.BreadthCrawler;
import xin.jiangqiang.utils.ImgUtils;

/**
 * @author jiangqiang
 * @date 2020/12/10 11:04
 */
public class MyAutoCrawlerAnno extends BreadthCrawler {
    /**
     * @param crawlPath crawlPath is the path of the directory which maintains
     *                  information of this crawler
     * @param autoParse if autoParse is true,BreadthCrawler will auto extract
     *                  links which match regex rules from pag
     */
    public MyAutoCrawlerAnno(String crawlPath, boolean autoParse) {
        super(crawlPath, autoParse);
        /*start pages*/
        this.addSeed("https://blog.jiangqiang.xin");
        //抓取图片资源
        this.addRegex("https://resource.jiangqiang.xin/halo-blog/.*");
        //抓取文章
        this.addRegex("https://blog.jiangqiang.xin/archives/.*");
        setThreads(50);
        getConf().setTopN(100);
    }

    @Override
    public void visit(Page page, CrawlDatums next) {

    }

    @MatchUrl(urlRegex = "https://blog.jiangqiang.xin/archives/.*")
    public void visitContent(Page page, CrawlDatums next) {
        System.out.println("title:" + page.select(".entry-header>.entry-title").first().text());
        System.out.println("content:" + page.select("article .entry-content").first().text());
    }

    @MatchUrl(urlRegex = "https://resource.jiangqiang.xin/halo-blog/.*")
    public void visitImg(Page page, CrawlDatums next) {
        ImgUtils.save(page, "crawl/images");
    }

    public static void main(String[] args) throws Exception {
        //第二个参数为true
        MyAutoCrawlerAnno crawler = new MyAutoCrawlerAnno("crawl", true);
        crawler.start(4);
    }
}

Q.E.D.


擅长前端的Java程序员