jsoup – 基本的なWebクローラーの例

投稿日: 2019-11-05 2019-11-05
タグ: crawler, html parser, jsoup, parser, web crawler

web-crawler-spider-logo]

Web Crawler

は、Webをナビゲートし、インデックス作成のための新しいページや更新されたページを見つけるプログラムです。クローラは、シードウェブサイトまたは一般的なURL（https://frontera.readthedocs.io/ja/v0.2.0/topics/what-is-a-crawl-frontier.html[frontier]）、抽出するハイパーリンクの深さと幅を検索します。

Webクローラは、親切で堅牢でなければなりません。クローラの優しさは、robots.txtによって設定されたルールを尊重し、ウェブサイトを頻繁に訪れることを避けることを意味します。堅牢性とは、スパイダートラップやその他の悪意のある行為を回避する機能を指します。 Webクローラのその他の優れた属性は、複数の分散マシン間の分散性、拡張性、連続性、およびページ品質に基づく優先順位付け能力です。

1. Webクローラを作成する手順

Webクローラを作成する基本的な手順は次のとおりです。

フロンティアからURLを選択する
HTMLコードを取得する
HTMLを解析して他のURLへのリンクを抽出する
URLを既にクロールしているかどうかを確認してください.

前と同じ内容

抽出されたURLごとに

確認に同意することを確認する（robots.txt、クロールの頻度）

真実は、インターネット上のすべてのページにわたって1つのWebクローラーを開発し、維持することです。…

1 10億のウェブサイト

この記事を読んでいる場合は、Webクローラを作成するためのガイドではなく、Web Scraperを作成するためのガイドは探していない可能性があります。なぜ、「基本Webクローラー」と呼ばれる記事はありますか？まあ…

それはキャッチーなので…本当に！クロールとスクレイパーの違いを知っている人はほとんどいません。だから、オフラインのデータスクラップであっても、すべてが「クロール」という言葉を使用する傾向があります。また、Web Scraperを構築するためには、クロールエージェントも必要です。そして最後に、この記事では実例を提供するだけでなく、情報を提供する予定であるためです。

2.クローラのスケルトン

HTML解析ではhttps://jsoup.org/[jsoup]を使用します。以下の例は、jsoupバージョン1.10.2を使用して開発されました。

pom.xml

        <dependency>
            <groupId>org.jsoup</groupId>
            <artifactId>jsoup</artifactId>
            <version>1.10.2</version>
        </dependency>

それでは、Webクローラーの基本コードから始めましょう。

BasicWebCrawler.java

package com.techfou.basicwebcrawler;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.IOException;
import java.util.HashSet;

public class BasicWebCrawler {

    private HashSet<String> links;

    public BasicWebCrawler() {
        links = new HashSet<String>();
    }

    public void getPageLinks(String URL) {
       //4. Check if you have already crawled the URLs
       //(we are intentionally not checking for duplicate content in this example)
        if (!links.contains(URL)) {
            try {
               //4. (i) If not add it to the index
                if (links.add(URL)) {
                    System.out.println(URL);
                }

               //2. Fetch the HTML code
                Document document = Jsoup.connect(URL).get();
               //3. Parse the HTML to extract links to other URLs
                Elements linksOnPage = document.select("a[href]");

               //5. For each extracted URL... go back to Step 4.
                for (Element page : linksOnPage) {
                    getPageLinks(page.attr("abs:href"));
                }
            } catch (IOException e) {
                System.err.println("For '" + URL + "': " + e.getMessage());
            }
        }
    }

    public static void main(String[]args) {
       //1. Pick a URL from the frontier
        new BasicWebCrawler().getPageLinks("//");
    }

}

注意** このコードを長時間実行しないでください。終了することなく何時間もかかることがあります。

サンプル出力:

….//<blockquote data-secret=”4eI71g0avB” class=”wp-embedded-content”><a href=”//tutorials/android-tutorial/”>Android Tutorial</a></blockquote><iframe class=”wp-embedded-content” sandbox=”allow-scripts” security=”restricted” style=”position: absolute; clip: rect(1px, 1px, 1px, 1px);” src=”//tutorials/android-tutorial/embed/

?secret=4eI71g0avB” data-secret=”4eI71g0avB” width=”500″ height=”282″ title=”“Android Tutorial” — Mkyong.com” frameborder=”0″ marginwidth=”0″ marginheight=”0″ scrolling=”no”></iframe>
<blockquote data-secret=”hMadHrtmLr” class=”wp-embedded-content”><a href=”//tutorials/android-tutorial/”>Android Tutorial</a></blockquote><iframe class=”wp-embedded-content” sandbox=”allow-scripts” security=”restricted” style=”position: absolute; clip: rect(1px, 1px, 1px, 1px);” src=”//tutorials/android-tutorial/embed/

?secret=hMadHrtmLr” data-secret=”hMadHrtmLr” width=”500″ height=”282″ title=”“Android Tutorial” — Mkyong.com” frameborder=”0″ marginwidth=”0″ marginheight=”0″ scrolling=”no”></iframe>
<blockquote data-secret=”C16yHWC4Ex” class=”wp-embedded-content”><a href=”//tutorials/java-io-tutorials/”>Java I/O Tutorial</a></blockquote><iframe class=”wp-embedded-content” sandbox=”allow-scripts” security=”restricted” style=”position: absolute; clip: rect(1px, 1px, 1px, 1px);” src=”//tutorials/java-io-tutorials/embed/

?secret=C16yHWC4Ex” data-secret=”C16yHWC4Ex” width=”500″ height=”282″ title=”“Java I/O Tutorial” — Mkyong.com” frameborder=”0″ marginwidth=”0″ marginheight=”0″ scrolling=”no”></iframe>
<blockquote data-secret=”oO7WeASaWm” class=”wp-embedded-content”><a href=”//tutorials/java-io-tutorials/”>Java I/O Tutorial</a></blockquote><iframe class=”wp-embedded-content” sandbox=”allow-scripts” security=”restricted” style=”position: absolute; clip: rect(1px, 1px, 1px, 1px);” src=”//tutorials/java-io-tutorials/embed/

?secret=oO7WeASaWm” data-secret=”oO7WeASaWm” width=”500″ height=”282″ title=”“Java I/O Tutorial” — Mkyong.com” frameborder=”0″ marginwidth=”0″ marginheight=”0″ scrolling=”no”></iframe>
<blockquote data-secret=”oxXfw8SE35″ class=”wp-embedded-content”><a href=”//tutorials/java-xml-tutorials/”>Java XML Tutorial</a></blockquote><iframe class=”wp-embedded-content” sandbox=”allow-scripts” security=”restricted” style=”position: absolute; clip: rect(1px, 1px, 1px, 1px);” src=”//tutorials/java-xml-tutorials/embed/

?secret=oxXfw8SE35″ data-secret=”oxXfw8SE35″ width=”500″ height=”282″ title=”“Java XML Tutorial” — Mkyong.com” frameborder=”0″ marginwidth=”0″ marginheight=”0″ scrolling=”no”></iframe>
<blockquote data-secret=”aUTZpMueVw” class=”wp-embedded-content”><a href=”//tutorials/java-xml-tutorials/”>Java XML Tutorial</a></blockquote><iframe class=”wp-embedded-content” sandbox=”allow-scripts” security=”restricted” style=”position: absolute; clip: rect(1px, 1px, 1px, 1px);” src=”//tutorials/java-xml-tutorials/embed/

?secret=aUTZpMueVw” data-secret=”aUTZpMueVw” width=”500″ height=”282″ title=”“Java XML Tutorial” — Mkyong.com” frameborder=”0″ marginwidth=”0″ marginheight=”0″ scrolling=”no”></iframe>
<blockquote data-secret=”BakIcf7jla” class=”wp-embedded-content”><a href=”//tutorials/java-json-tutorials/”>Java JSON Tutorial</a></blockquote><iframe class=”wp-embedded-content” sandbox=”allow-scripts” security=”restricted” style=”position: absolute; clip: rect(1px, 1px, 1px, 1px);” src=”//tutorials/java-json-tutorials/embed/

?secret=BakIcf7jla” data-secret=”BakIcf7jla” width=”500″ height=”282″ title=”“Java JSON Tutorial” — Mkyong.com” frameborder=”0″ marginwidth=”0″ marginheight=”0″ scrolling=”no”></iframe>
<blockquote data-secret=”G6vUVDKahI” class=”wp-embedded-content”><a href=”//tutorials/java-json-tutorials/”>Java JSON Tutorial</a></blockquote><iframe class=”wp-embedded-content” sandbox=”allow-scripts” security=”restricted” style=”position: absolute; clip: rect(1px, 1px, 1px, 1px);” src=”//tutorials/java-json-tutorials/embed/

?secret=G6vUVDKahI” data-secret=”G6vUVDKahI” width=”500″ height=”282″ title=”“Java JSON Tutorial” — Mkyong.com” frameborder=”0″ marginwidth=”0″ marginheight=”0″ scrolling=”no”></iframe>
<blockquote data-secret=”ldTvyk2JqD” class=”wp-embedded-content”><a href=”//tutorials/java-regular-expression-tutorials/”>Java Regular Expression Tutorial</a></blockquote><iframe class=”wp-embedded-content” sandbox=”allow-scripts” security=”restricted” style=”position: absolute; clip: rect(1px, 1px, 1px, 1px);” src=”//tutorials/java-regular-expression-tutorials/embed/

?secret=ldTvyk2JqD” data-secret=”ldTvyk2JqD” width=”500″ height=”282″ title=”“Java Regular Expression Tutorial” — Mkyong.com” frameborder=”0″ marginwidth=”0″ marginheight=”0″ scrolling=”no”></iframe>
<blockquote data-secret=”NoVTpAtlJ2″ class=”wp-embedded-content”><a href=”//tutorials/java-regular-expression-tutorials/”>Java Regular Expression Tutorial</a></blockquote><iframe class=”wp-embedded-content” sandbox=”allow-scripts” security=”restricted” style=”position: absolute; clip: rect(1px, 1px, 1px, 1px);” src=”//tutorials/java-regular-expression-tutorials/embed/

?secret=NoVTpAtlJ2″ data-secret=”NoVTpAtlJ2″ width=”500″ height=”282″ title=”“Java Regular Expression Tutorial” — Mkyong.com” frameborder=”0″ marginwidth=”0″ marginheight=”0″ scrolling=”no”></iframe>
<blockquote data-secret=”Qy2Y1iGeyR” class=”wp-embedded-content”><a href=”//tutorials/jdbc-tutorials/”>JDBC Tutorial</a></blockquote><iframe class=”wp-embedded-content” sandbox=”allow-scripts” security=”restricted” style=”position: absolute; clip: rect(1px, 1px, 1px, 1px);” src=”//tutorials/jdbc-tutorials/embed/#?secret=Qy2Y1iGeyR” data-secret=”Qy2Y1iGeyR” width=”500″ height=”282″ title=”“JDBC Tutorial” — Mkyong.com” frameborder=”0″ marginwidth=”0″ marginheight=”0″ scrolling=”no”></iframe>
…(+ many more links)

Like we mentioned before, a Web Crawler searches in width and depth for
links. If we imagine the links on a web site in a tree-like structure,
the root node or level zero would be the link we start with, the next
level would be all the links that we found on level zero and so on.

=== 3. Taking crawling depth into account

We will modify the previous example to set depth of link extraction.
Notice that the only true difference between this example and the
previous is that the recursive `getPageLinks()` method has an integer
argument that represents the depth of the link which is also added as a
condition in the `if...else` statement.

WebCrawlerWithDepth.java

パッケージcom.techfou.depthwebcrawler;

import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements;

import java.io.IOException; import java.util.HashSet;

パブリッククラスWebCrawlerWithDepth {プライベートstatic final int MAX__DEPTH = 2;プライベートHashSetの<String>リンク。

public WebCrawlerWithDepth（）{links =新しいHashSet <>（）; }

（ “深さ:” 深さ “[” + URL）]をクリックすると、getPageLinks（String URL、int depth） + “]”）; try {links.add（URL）;

ドキュメントドキュメント= Jsoup.connect（URL）.get（）;要素linksOnPage = document.select（ “a[href]”）;

深さ+; for（要素ページ:linksOnPage）{getPageLinks（page.attr（ “abs:href”）、depth）; }} catch（IOException e）{System.err.println（ “For” ” URL ” ‘: ” e.getMessage（）”; }}}

public static void main（String[]args）{
新しいWebCrawlerWithDepth（）。getPageLinks（ “//”、0）;
}
}

** 注**  +
上記のコードを自由に実行してください。私のラップトップでは数分しかかかりませんでした
深さは2に設定されています。深度が深いほど、
終了するまでに時間がかかります。

サンプル出力:

…
>> Depth: 1[

https://docs.gradle.org/current/userguide/userguide.html

]>> Depth: 1[

http://hibernate.org/orm/

]>> Depth: 1[

https://jax-ws.java.net/

]>> Depth: 1[

http://tomcat.apache.org/tomcat-8.0-doc/index.html

]>> Depth: 1[

http://www.javacodegeeks.com/

]For ‘http://www.javacodegeeks.com/’: HTTP error fetching URL
>> Depth: 1[

http://beust.com/weblog/

]>> Depth: 1[

https://dzone.com

]>> Depth: 1[

https://wordpress.org/

]>> Depth: 1[

Home

]>> Depth: 1[//privacy-policy/]….

4.データスクラップとデータクロール

これまでのところ、この問題に関する理論的アプローチにはとても良いことです。実際には、一般的なクローラを構築することはほとんどありません。また、「本当の」クローラが必要な場合は、すでに存在するツールを使用する必要があります。平均的な開発者が行うことのほとんどは、特定のWebサイトから特定の情報を抽出することです。Webクローラーの構築も含まれますが、実際はWeb Scrapingと呼ばれています。

Data Scraping vs. Data Crawling

のPromptCloudのArpan Jha氏の非常に優れた記事があります。この記事は、この区別を理解するために個人的に助けてくれました私はそれを読むことをお勧めします。

この記事で取り上げたテーブルを使って要約すると:

|どんな規模でもできる|大部分が大規模に行われる

|重複排除は必ずしも一部ではない|重複排除は不可欠な部分です

|クロールエージェントとパーサーが必要|クロールエージェントのみが必要| ============================ =====================

イントロで約束されているように、理論から抜け出して実行可能な例に移る時間。 mkyong.comからJava 8に関連する記事のすべてのURLを取得するシナリオを想像してみましょう。私たちの目標は可能な限り最短時間で情報を取得し、ウェブサイト全体のクロールを避けることです。さらに、このアプローチはサーバーのリソースを浪費するだけでなく、時間も節約します。

5.ケーススタディ – mkyong.comで「Java 8」のすべての記事を抽出する

5.1まず、ウェブサイトのコードを見てください。 mkyong.comを簡単に見てみると、最初のページでページングを気付くことができ、各ページの `/page/xx`パターンに従っていることがわかります。

jsoup-web-crawler-example-1

これは、
/page/`を含むすべてのリンクを検索することで、私たちが探している情報に簡単にアクセスできることを実現することになります。したがって、ウェブサイト全体を実行する代わりに、 `document.select（" a[href ^ = \ "//page/\"]"）`を使って検索を制限します。この `css selector 'では
/page/`で始まるリンクのみを収集します。

5.2私たちが気づくのは、記事のタイトルは、<h2> </h2>
と
<a href=””> </a> `タグで囲まれているということです。

jsoup-web-crawler-example-2

したがって、記事のタイトルを抽出するには、
select`メソッドをその正確な情報に制限する
css selector ‘を使ってその特定の情報にアクセスします: `document.select（” h2 a[href ^ = \ “//\”]” ）; `

5.3最後に、タイトルに「Java 8」が含まれているリンクのみを保持し、それらを `file`に保存します。

Extractor.java

package com.mkyong.extractor;

package com.mkyong;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.FileWriter;
import java.io.IOException;
import java.util.ArrayList;
import java.util.HashSet;
import java.util.List;

public class Extractor {
    private HashSet<String> links;
    private List<List<String>> articles;

    public Extractor() {
        links = new HashSet<>();
        articles = new ArrayList<>();
    }

   //Find all URLs that start with "//page/" and add them to the HashSet
    public void getPageLinks(String URL) {
        if (!links.contains(URL)) {
            try {
                Document document = Jsoup.connect(URL).get();
                Elements otherLinks = document.select("a[href^=\"//page/\"]");

                for (Element page : otherLinks) {
                    if (links.add(URL)) {
                       //Remove the comment from the line below if you want to see it running on your editor
                        System.out.println(URL);
                    }
                    getPageLinks(page.attr("abs:href"));
                }
            } catch (IOException e) {
                System.err.println(e.getMessage());
            }
        }
    }

   //Connect to each link saved in the article and find all the articles in the page
    public void getArticles() {
        links.forEach(x -> {
            Document document;
            try {
                document = Jsoup.connect(x).get();
                Elements articleLinks = document.select("h2 a[href^=\"//\"]");
                for (Element article : articleLinks) {
                   //Only retrieve the titles of the articles that contain Java 8
                    if (article.text().matches("^.** ?(Java 8|java 8|JAVA 8).** $")) {
                       //Remove the comment from the line below if you want to see it running on your editor,
                       //or wait for the File at the end of the execution
                       //System.out.println(article.attr("abs:href"));

                        ArrayList<String> temporary = new ArrayList<>();
                        temporary.add(article.text());//The title of the article
                        temporary.add(article.attr("abs:href"));//The URL of the article
                        articles.add(temporary);
                    }
                }
            } catch (IOException e) {
                System.err.println(e.getMessage());
            }
        });
    }

    public void writeToFile(String filename) {
        FileWriter writer;
        try {
            writer = new FileWriter(filename);
            articles.forEach(a -> {
                try {
                    String temp = "- Title: " + a.get(0) + " (link: " + a.get(1) + ")\n";
                   //display to console
                    System.out.println(temp);
                   //save to file
                    writer.write(temp);
                } catch (IOException e) {
                    System.err.println(e.getMessage());
                }
            });
            writer.close();
        } catch (IOException e) {
            System.err.println(e.getMessage());
        }
    }

    public static void main(String[]args) {
        Extractor bwc = new Extractor();
        bwc.getPageLinks("/");
        bwc.getArticles();
        bwc.writeToFile("Java 8 Articles");
    }
}

出力:

jsoup-web-crawler-example-3、width = 1405、height = 435