Luceneアナライザーのガイド

投稿日: 2019-10-18 2022-10-30
タグ: Data, Lucene, NoSQL

1. 概要

Lucene Analyzerは、ドキュメントのインデックス作成と検索中にテキストを分析するために使用されます。

入門チュートリアルでアナライザーについて簡単に説明しました。

このチュートリアルでは、一般的に使用されるアナライザー、カスタムアナライザーの構築方法、およびさまざまなドキュメントフィールドにさまざまなアナライザーを割り当てる方法について説明します。

2. Mavenの依存関係

まず、これらの依存関係をpom.xmlに追加する必要があります。

<dependency>
    <groupId>org.apache.lucene</groupId>
    <artifactId>lucene-core</artifactId>
    <version>7.4.0</version>
</dependency>
<dependency>
    <groupId>org.apache.lucene</groupId>
    <artifactId>lucene-queryparser</artifactId>
    <version>7.4.0</version>
</dependency>
<dependency>
    <groupId>org.apache.lucene</groupId>
    <artifactId>lucene-analyzers-common</artifactId>
    <version>7.4.0</version>
</dependency>

最新のLuceneバージョンはここにあります。

3. Luceneアナライザー

Lucene Analyzerは、テキストをトークンに分割します。

アナライザーは主にトークナイザーとフィルターで構成されています。さまざまなアナライザーは、トークナイザーとフィルターのさまざまな組み合わせで構成されています。

一般的に使用されるアナライザーの違いを示すために、次の方法を使用します。

public List<String> analyze(String text, Analyzer analyzer) throws IOException{
    List<String> result = new ArrayList<String>();
    TokenStream tokenStream = analyzer.tokenStream(FIELD_NAME, text);
    CharTermAttribute attr = tokenStream.addAttribute(CharTermAttribute.class);
    tokenStream.reset();
    while(tokenStream.incrementToken()) {
       result.add(attr.toString());
    }       
    return result;
}

このメソッドは、指定されたアナライザーを使用して、指定されたテキストをトークンのリストに変換します。

4. 一般的なLuceneアナライザー

それでは、一般的に使用されるLuceneアナライザーをいくつか見てみましょう。

4.1. StandardAnalyzer

まず、最も一般的に使用されるアナライザーであるStandardAnalyzerから始めます。

private static final String SAMPLE_TEXT
  = "This is baeldung.com Lucene Analyzers test";

@Test
public void whenUseStandardAnalyzer_thenAnalyzed() throws IOException {
    List<String> result = analyze(SAMPLE_TEXT, new StandardAnalyzer());

    assertThat(result, 
      contains("baeldung.com", "lucene", "analyzers","test"));
}

StandardAnalyzerはURLと電子メールを認識できることに注意してください。

また、ストップワードを削除し、生成されたトークンを小文字にします。

4.2. StopAnalyzer

StopAnalyzer は、 LetterTokenizer、LowerCaseFilter 、および StopFilter：で構成されています。

@Test
public void whenUseStopAnalyzer_thenAnalyzed() throws IOException {
    List<String> result = analyze(SAMPLE_TEXT, new StopAnalyzer());

    assertThat(result, 
      contains("baeldung", "com", "lucene", "analyzers", "test"));
}

この例では、 LetterTokenizer はテキストを文字以外の文字で分割し、StopFilterはトークンリストからストップワードを削除します。

ただし、 StandardAnalyzer とは異なり、StopAnalyzerはURLを認識できません。

4.3. SimpleAnalyzer

SimpleAnalyzer は、LetterTokenizerとLowerCaseFilterで構成されています。

@Test
public void whenUseSimpleAnalyzer_thenAnalyzed() throws IOException {
    List<String> result = analyze(SAMPLE_TEXT, new SimpleAnalyzer());

    assertThat(result, 
      contains("this", "is", "baeldung", "com", "lucene", "analyzers", "test"));
}

ここで、SimpleAnalyzerはストップワードを削除しませんでした。また、URLを認識しません。

4.4. WhitespaceAnalyzer

WhitespaceAnalyzer は、テキストを空白文字で分割するWhitespaceTokenizerのみを使用します。

@Test
public void whenUseWhiteSpaceAnalyzer_thenAnalyzed() throws IOException {
    List<String> result = analyze(SAMPLE_TEXT, new WhitespaceAnalyzer());

    assertThat(result, 
      contains("This", "is", "baeldung.com", "Lucene", "Analyzers", "test"));
}

4.5. キーワードアナライザー

KeyboardAnalyzer は、入力を単一のトークンにトークン化します。

@Test
public void whenUseKeywordAnalyzer_thenAnalyzed() throws IOException {
    List<String> result = analyze(SAMPLE_TEXT, new KeywordAnalyzer());

    assertThat(result, contains("This is baeldung.com Lucene Analyzers test"));
}

KeyboardAnalyzer は、IDやzipcodeなどのフィールドに役立ちます。

4.6. 言語アナライザー

EnglishAnalyzer 、 FrenchAnalyzer 、SpanishAnalyzerなどのさまざまな言語用の特別なアナライザーもあります。

@Test
public void whenUseEnglishAnalyzer_thenAnalyzed() throws IOException {
    List<String> result = analyze(SAMPLE_TEXT, new EnglishAnalyzer());

    assertThat(result, contains("baeldung.com", "lucen", "analyz", "test"));
}

ここでは、 StandardTokenizer 、 StandardFilter 、 EnglishPossessiveFilter 、 LowerCaseFilter 、[X21X]で構成されるEnglishAnalyzerを使用しています。 X178X] StopFilter 、およびPorterStemFilter。

5. カスタムアナライザー

次に、カスタムアナライザーの構築方法を見てみましょう。同じカスタムアナライザーを2つの異なる方法で構築します。

最初の例では、 CustomAnalyzer Builderを使用して、事前定義されたトークナイザーとフィルターからアナライザーを構築します：

@Test
public void whenUseCustomAnalyzerBuilder_thenAnalyzed() throws IOException {
    Analyzer analyzer = CustomAnalyzer.builder()
      .withTokenizer("standard")
      .addTokenFilter("lowercase")
      .addTokenFilter("stop")
      .addTokenFilter("porterstem")
      .addTokenFilter("capitalization")
      .build();
    List<String> result = analyze(SAMPLE_TEXT, analyzer);

    assertThat(result, contains("Baeldung.com", "Lucen", "Analyz", "Test"));
}

私たちのアナライザーはEnglishAnalyzerに非常に似ていますが、代わりにトークンを大文字にします。

2番目の例では、 Analyzer抽象クラスを拡張し、createComponents（）メソッドをオーバーライドして、同じアナライザーを構築します。

public class MyCustomAnalyzer extends Analyzer {

    @Override
    protected TokenStreamComponents createComponents(String fieldName) {
        StandardTokenizer src = new StandardTokenizer();
        TokenStream result = new StandardFilter(src);
        result = new LowerCaseFilter(result);
        result = new StopFilter(result,  StandardAnalyzer.STOP_WORDS_SET);
        result = new PorterStemFilter(result);
        result = new CapitalizationFilter(result);
        return new TokenStreamComponents(src, result);
    }
}

カスタムトークナイザーまたはフィルターを作成し、必要に応じてカスタムアナライザーに追加することもできます。

次に、カスタムアナライザーの動作を見てみましょう。この例ではInMemoryLuceneIndexを使用します。

@Test
public void givenTermQuery_whenUseCustomAnalyzer_thenCorrect() {
    InMemoryLuceneIndex luceneIndex = new InMemoryLuceneIndex(
      new RAMDirectory(), new MyCustomAnalyzer());
    luceneIndex.indexDocument("introduction", "introduction to lucene");
    luceneIndex.indexDocument("analyzers", "guide to lucene analyzers");
    Query query = new TermQuery(new Term("body", "Introduct"));

    List<Document> documents = luceneIndex.searchIndex(query);
    assertEquals(1, documents.size());
}

6. PerFieldAnalyzerWrapper

最後に、 PerFieldAnalyzerWrapper を使用して、さまざまなアナライザーをさまざまなフィールドに割り当てることができます。

まず、 analogzerMap を定義して、各アナライザーを特定のフィールドにマップする必要があります。

Map<String,Analyzer> analyzerMap = new HashMap<>();
analyzerMap.put("title", new MyCustomAnalyzer());
analyzerMap.put("body", new EnglishAnalyzer());

「タイトル」をカスタムアナライザーにマッピングし、「本文」をEnglishAnalyzerにマッピングしました。

次に、AnalyzerMapとデフォルトのAnalyzerを指定して、PerFieldAnalyzerWrapperを作成しましょう。

PerFieldAnalyzerWrapper wrapper = new PerFieldAnalyzerWrapper(
  new StandardAnalyzer(), analyzerMap);

それでは、テストしてみましょう。

@Test
public void givenTermQuery_whenUsePerFieldAnalyzerWrapper_thenCorrect() {
    InMemoryLuceneIndex luceneIndex = new InMemoryLuceneIndex(new RAMDirectory(), wrapper);
    luceneIndex.indexDocument("introduction", "introduction to lucene");
    luceneIndex.indexDocument("analyzers", "guide to lucene analyzers");
    
    Query query = new TermQuery(new Term("body", "introduct"));
    List<Document> documents = luceneIndex.searchIndex(query);
    assertEquals(1, documents.size());
    
    query = new TermQuery(new Term("title", "Introduct"));
    documents = luceneIndex.searchIndex(query);
    assertEquals(1, documents.size());
}

7. 結論

人気のあるLuceneアナライザー、カスタムアナライザーの構築方法、およびフィールドごとに異なるアナライザーを使用する方法について説明しました。

完全なソースコードはGitHubにあります。

getdocs

13036