このチュートリアルでは、HTMLページからハイパーリンクを抽出する方法を説明します。たとえば、次のコンテンツからリンクを取得するには:

this is text1 <a href='mkyong.com' target='__blank'>hello</a> this is text2...

  1. まず `a`タグから” value “を取得します – 結果:


a href = 'mkyong.com' target = '__ blank'

。後で抽出された値から「リンク」を得る – 結果:


mkyong.com

1.正規表現パターン

タグを抽出する正規表現パターン

(?i)<a([^>]+)>(.+?)</a>

タグからのリンクを抽出する正規表現パターン

\s** (?i)href\s** =\s** (\"([^"]** \")|'[^']** '|([^'">\s]+));

説明

(       #start of group #1
 ?i     #  all checking are case insensive
)       #end of group #1
<a              #start with "<a"
  (     #  start of group #2
   [^>]+    #     anything except (">"), at least one character
   )        #  end of group #2
  >      #     follow by ">"
    (.+?)   #   match anything
         </a> #     end with "</a>

\s**             #can start with whitespace
  (?i)             # all checking are case insensive
     href          #  follow by "href" word
        \s** =\s**         #   allows spaces on either side of the equal sign,
              (        #    start of group #1
               "([^"]** ")   #      allow string with double quotes enclosed - "string"
               |       #      ..or
               '[^']** '     #        allow string with single quotes enclosed - 'string'
               |           #      ..or
               ([^'">]+)   #      can't contains one single quotes, double quotes ">"
          )        #    end of group #1

ここでは単純なJava Link Extractorの例を示します。最初のパターンから `a`タグ値を抽出し、2番目のパターンを使用して1番目のパターンからリンクを抽出します。

HTMLLinkExtractor.java

package com.mkyong.crawler.core;

import java.util.Vector;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class HTMLLinkExtractor {

    private Pattern patternTag, patternLink;
    private Matcher matcherTag, matcherLink;

    private static final String HTML__A__TAG__PATTERN = "(?i)<a([^>]+)>(.+?)</a>";
    private static final String HTML__A__HREF__TAG__PATTERN =
        "\\s** (?i)href\\s** =\\s** (\"([^\"]** \")|'[^']** '|([^'\">\\s]+))";


    public HTMLLinkExtractor() {
        patternTag = Pattern.compile(HTML__A__TAG__PATTERN);
        patternLink = Pattern.compile(HTML__A__HREF__TAG__PATTERN);
    }

   /** **
     **  Validate html with regular expression
     **
     **  @param html
     **             html content for validation
     **  @return Vector links and link text
     ** /    public Vector<HtmlLink> grabHTMLLinks(final String html) {

        Vector<HtmlLink> result = new Vector<HtmlLink>();

        matcherTag = patternTag.matcher(html);

        while (matcherTag.find()) {

            String href = matcherTag.group(1);//href
            String linkText = matcherTag.group(2);//link text

            matcherLink = patternLink.matcher(href);

            while (matcherLink.find()) {

                String link = matcherLink.group(1);//link
                HtmlLink obj = new HtmlLink();
                obj.setLink(link);
                obj.setLinkText(linkText);

                result.add(obj);

            }

        }

        return result;

    }

    class HtmlLink {

        String link;
        String linkText;

        HtmlLink(){};

        @Override
        public String toString() {
            return new StringBuffer("Link : ").append(this.link)
            .append(" Link Text : ").append(this.linkText).toString();
        }

        public String getLink() {
            return link;
        }

        public void setLink(String link) {
            this.link = replaceInvalidChar(link);
        }

        public String getLinkText() {
            return linkText;
        }

        public void setLinkText(String linkText) {
            this.linkText = linkText;
        }

        private String replaceInvalidChar(String link){
            link = link.replaceAll("'", "");
            link = link.replaceAll("\"", "");
            return link;
        }

    }
}

ユニットテスト

TestNGによるユニットテスト。 `@ DataProvider`を介してHTMLコンテンツをシミュレートします。

TestHTMLLinkExtractor.java

package com.mkyong.crawler.core;

import java.util.Vector;

import org.testng.Assert;
import org.testng.annotations.BeforeClass;
import org.testng.annotations.DataProvider;
import org.testng.annotations.Test;

import com.mkyong.crawler.core.HTMLLinkExtractor.HtmlLink;
/** **
 **  HTML link extrator Testing
 **
 **  @author mkyong
 **
 ** /public class TestHTMLLinkExtractor {

    private HTMLLinkExtractor htmlLinkExtractor;
    String TEST__LINK = "http://www.google.com";

    @BeforeClass
    public void initData() {
        htmlLinkExtractor = new HTMLLinkExtractor();
    }

    @DataProvider
    public Object[][]HTMLContentProvider() {
      return new Object[][]{
        new Object[]{ "abc hahaha <a href='" + TEST__LINK + "'>google</a>" },
        new Object[]{ "abc hahaha <a HREF='" + TEST__LINK + "'>google</a>" },

        new Object[]{ "abc hahaha <A HREF='" + TEST__LINK + "'>google</A> , "
        + "abc hahaha <A HREF='" + TEST__LINK + "' target='__blank'>google</A>" },

        new Object[]{ "abc hahaha <A HREF='" + TEST__LINK + "' target='__blank'>google</A>" },
        new Object[]{ "abc hahaha <A target='__blank' HREF='" + TEST__LINK + "'>google</A>" },
        new Object[]{ "abc hahaha <A target='__blank' HREF=\"" + TEST__LINK + "\">google</A>" },
        new Object[]{ "abc hahaha <a HREF=" + TEST__LINK + ">google</a>" }, };
    }

    @Test(dataProvider = "HTMLContentProvider")
    public void ValidHTMLLinkTest(String html) {

        Vector<HtmlLink> links = htmlLinkExtractor.grabHTMLLinks(html);

       //there must have something
        Assert.assertTrue(links.size() != 0);

        for (int i = 0; i < links.size(); i++) {
            HtmlLink htmlLinks = links.get(i);
           //System.out.println(htmlLinks);
            Assert.assertEquals(htmlLinks.getLink(), TEST__LINK);
        }

    }
}


結果

…​.[TestNG]Running:
/private/var/folders/w8/jxyz5pf51lz7nmqm__hv5z5br0000gn/T/testng-eclipse—​530204890/testng-customsuite.xml

PASSED: ValidHTMLLinkTest(“abc hahaha <a href=’http://www.google.com’>google</a>”)
PASSED: ValidHTMLLinkTest(“abc hahaha <a HREF=’http://www.google.com’>google</a>”)
PASSED: ValidHTMLLinkTest(“abc hahaha <A HREF=’http://www.google.com’>google</A> , abc hahaha <A HREF=’http://www.google.com’ target=’

blank’>google</A>”)
PASSED: ValidHTMLLinkTest(“abc hahaha <A HREF=’http://www.google.com’ target=’

blank’>google</A>”)
PASSED: ValidHTMLLinkTest(“abc hahaha <A target=’

blank’ HREF=’http://www.google.com’>google</A>”)
PASSED: ValidHTMLLinkTest(“abc hahaha <A target=’

blank’ HREF=”http://www.google.com”>google</A>”)
PASSED: ValidHTMLLinkTest(“abc hahaha <a HREF=http://www.google.com>google</a>”)

=== 参考文献

.  http://testng.org/doc/documentation-main.html[TestNG documentation]

.  http://ja.wikipedia.org/wiki/Hyperlink[Hyperlink in Wiki]

link://tag/html/[html]link://タグ/regex/[正規表現]