jsoup HTMLパーサのhello worldの例

getdocs

7年前

htmlパーサー、幅= 300、高さ= 250

Jsoup

、HTMLパーサー、その “jquery-like”と “regex”セレクタ構文は、使い方が非常に簡単で、必要なものを得るのに十分柔軟です。 Jsoupを使用してHTMLページからリンク、画像、ページタイトル、および「div」要素のコンテンツを取得する方法を示す3つの例を以下に示します。

pom.xml

        <dependency>
            <groupId>org.jsoup</groupId>
            <artifactId>jsoup</artifactId>
            <version>1.10.2</version>
        </dependency>

1.すべてのハイパーリンクを取得する

この例では、jsoupを使用してページのタイトルを取得し、「google.com」のすべてのリンクを取得する方法を示します。

HTMLParserExample1.java

package com.mkyong;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.IOException;

public class HTMLParserExample1 {

    public static void main(String[]args) {

        Document doc;
        try {

           //need http protocol
            doc = Jsoup.connect("http://google.com").get();

           //get page title
            String title = doc.title();
            System.out.println("title : " + title);

           //get all links
            Elements links = doc.select("a[href]");
            for (Element link : links) {

               //get the value from href attribute
                System.out.println("\nlink : " + link.attr("href"));
                System.out.println("text : " + link.text());

            }

        } catch (IOException e) {
            e.printStackTrace();
        }

    }

}

出力

title : Google

link : http://www.google.com.my/imghp?hl=en&tab=wi
text : Images

link : http://maps.google.com.my/maps?hl=en&tab=wl
text : Maps
//omitted for readability

Note ** + HTTP 403エラーメッセージを避けるため、Jsoupに “userAgent”を指定することをお勧めします。

    Document doc = Jsoup.connect("http://anyurl.com")
    .userAgent("Mozilla")
    .get();

2.すべての画像を掴む

2番目の例は、 “yahoo.com”からすべてのイメージファイル（png、jpg、gif）を取得するためにJsoup regexセレクタを使用する方法を示しています。

HTMLParserExample2.java

package com.mkyong;

package com.mkyong;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.IOException;

public class HTMLParserExample2 {

    public static void main(String[]args) {

        Document doc;
        try {

           //get all images
            doc = Jsoup.connect("http://yahoo.com").get();
            Elements images = doc.select("img[src~=(?i)\\.(png|jpe?g|gif)]");
            for (Element image : images) {

                System.out.println("\nsrc : " + image.attr("src"));
                System.out.println("height : " + image.attr("height"));
                System.out.println("width : " + image.attr("width"));
                System.out.println("alt : " + image.attr("alt"));

            }

        } catch (IOException e) {
            e.printStackTrace();
        }

    }

}

出力

src : http://l.yimg.com/a/i/mntl/ww/events/p.gif
height : 50
width : 202
alt : Yahoo!

src : http://l.yimg.com/a/i/ww/met/intl__flag__icons/20111011/my__flag.gif
height :
width :
alt :
//omitted for readability

3. Meta要素を取得する

最後の例は、オフラインHTMLページをシミュレートし、jsoupを使用してコンテンツを解析します。 ”

meta

“キーワードと説明、および “color”のidを持つdiv要素を取得します。

HTMLParserExample3.java

package com.mkyong;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

public class HTMLParserExample3 {

    public static void main(String[]args) {

        StringBuffer html = new StringBuffer();

        html.append("<!DOCTYPE html>");
        html.append("<html lang=\"en\">");
        html.append("<head>");
        html.append("<meta charset=\"UTF-8\"/>");
        html.append("<title>Hollywood Life</title>");
        html.append("<meta name=\"description\" content=\"The latest entertainment news\"/>");
        html.append("<meta name=\"keywords\" content=\"hollywood gossip, hollywood news\"/>");
        html.append("</head>");
        html.append("<body>");
        html.append("<div id='color'>This is red</div>/>");
        html.append("</body>");
        html.append("</html>");

        Document doc = Jsoup.parse(html.toString());

       //get meta description content
        String description = doc.select("meta[name=description]").get(0).attr("content");
        System.out.println("Meta description : " + description);

       //get meta keyword content
        String keywords = doc.select("meta[name=keywords]").first().attr("content");
        System.out.println("Meta keyword : " + keywords);

        String color1 = doc.getElementById("color").text();
        String color2 = doc.select("div#color").get(0).text();

        System.out.println(color1);
        System.out.println(color2);

    }

}

出力

Meta description : The latest entertainment news
Meta keyword : hollywood gossip, hollywood news
This is red
This is red

4.フォーム入力を取得する

このコードスニペットは、Jsoupを使用してHTMLフォーム入力（名前と値）を取得する方法を示しています。詳細な使用法については、このリンクを参照してください://java/how-to-automate-login-a-website-java-example/[Javaでウェブサイトを自動ログイン]

public void getFormParams(String html){

    Document doc = Jsoup.parse(html);

   //HTML form id
    Element loginform = doc.getElementById("your__form__id");

    Elements inputElements = loginform.getElementsByTag("input");

    List<String> paramList = new ArrayList<String>();
    for (Element inputElement : inputElements) {
        String key = inputElement.attr("name");
        String value = inputElement.attr("value");
    }

}

5. Favを取得するアイコン

このコードは、Jsoupを使用してページのお気に入りのアイコンを表示する方法を示しています。

jSoupExample.java

package com.mkyong;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;

public class jSoupExample {

    public static void main(String[]args) {

    StringBuffer html = new StringBuffer();

    html.append("<html lang=\"en\">");
    html.append("<head>");
    html.append("<link rel=\"icon\" href=\"http://example.com/image.ico\"/>");
   //html.append("<meta content=\"/images/google__favicon__128.png\" itemprop=\"image\">");
    html.append("</head>");
    html.append("<body>");
    html.append("something");
    html.append("</body>");
    html.append("</html>");

    Document doc = Jsoup.parse(html.toString());

    String fav = "";

    Element element = doc.head().select("link[href~=.** \\.(ico|png)]").first();
    if(element==null){

        element = doc.head().select("meta[itemprop=image]").first();
        if(element!=null){
            fav = element.attr("content");
        }
    }else{
        fav = element.attr("href");
    }
    System.out.println(fav);
  }

}

出力