Javaで文字列からストップワードを削除する

1. 概要

このチュートリアルでは、JavaのStringからストップワードを削除するさまざまな方法について説明します。これは、オンラインサイトのユーザーが追加したコメントやレビューなど、不要な単語や許可されていない単語をテキストから削除する場合に便利な操作です。

単純なループCollection.removeAll（）と正規表現を使用します。

最後に、Javaマイクロベンチマークハーネスを使用してパフォーマンスを比較します。

2. ストップワードの読み込み

まず、テキストファイルからストップワードをロードします。

ここに、 I 、 he 、 sheなどのストップワードと見なされる単語のリストを含むファイルenglish_stopwords.txtがあります。およびthe。

ストップワードをString using Files.readAllLines（）のListにロードします。

@BeforeClass
public static void loadStopwords() throws IOException {
    stopwords = Files.readAllLines(Paths.get("english_stopwords.txt"));
}

3. ストップワードを手動で削除する

最初の解決策として、各単語を繰り返し処理し、それがストップワードであるかどうかを確認することで、ストップワードを手動で削除します：

@Test
public void whenRemoveStopwordsManually_thenSuccess() {
    String original = "The quick brown fox jumps over the lazy dog"; 
    String target = "quick brown fox jumps lazy dog";
    String[] allWords = original.toLowerCase().split(" ");

    StringBuilder builder = new StringBuilder();
    for(String word : allWords) {
        if(!stopwords.contains(word)) {
            builder.append(word);
            builder.append(' ');
        }
    }
    
    String result = builder.toString().trim();
    assertEquals(result, target);
}

**4. Collection.removeAll（）を使用する**

次に、 String の各単語を繰り返す代わりに、 Collection.removeAll（）を使用して、すべてのストップワードを一度に削除できます：

@Test
public void whenRemoveStopwordsUsingRemoveAll_thenSuccess() {
    ArrayList<String> allWords = 
      Stream.of(original.toLowerCase().split(" "))
            .collect(Collectors.toCollection(ArrayList<String>::new));
    allWords.removeAll(stopwords);

    String result = allWords.stream().collect(Collectors.joining(" "));
    assertEquals(result, target);
}

この例では、 String を単語の配列に分割した後、それを ArrayList に変換して、 removeAll（）メソッドを適用できるようにします。。

5. 正規表現の使用

最後に、ストップワードリストから正規表現を作成し、それを使用してStringのストップワードを置き換えることができます。

@Test
public void whenRemoveStopwordsUsingRegex_thenSuccess() {
    String stopwordsRegex = stopwords.stream()
      .collect(Collectors.joining("|", "\\b(", ")\\b\\s?"));

    String result = original.toLowerCase().replaceAll(stopwordsRegex, "");
    assertEquals(result, target);
}

結果のstopwordsRegexの形式は、「\\ b（he | she | the |…）\\ b \\ s？」になります。この正規表現では、「\ b」は単語の境界を指します。たとえば、「\ s？」の間に「heat」で「he」が置き換えられるのを避けるためです。ストップワードを置き換えた後に余分なスペースを削除するために、ゼロまたは1つのスペースを指します。

6. パフォーマンスの比較

それでは、どの方法が最高のパフォーマンスを発揮するかを見てみましょう。

まず、ベンチマークを設定しましょう。 shakespeare-hamlet.txtと呼ばれる文字列のソースとしてかなり大きなテキストファイルを使用します。

@Setup
public void setup() throws IOException {
    data = new String(Files.readAllBytes(Paths.get("shakespeare-hamlet.txt")));
    data = data.toLowerCase();
    stopwords = Files.readAllLines(Paths.get("english_stopwords.txt"));
    stopwordsRegex = stopwords.stream().collect(Collectors.joining("|", "\\b(", ")\\b\\s?"));
}

次に、 removeManually（）で始まるベンチマークメソッドを用意します。

@Benchmark
public String removeManually() {
    String[] allWords = data.split(" ");
    StringBuilder builder = new StringBuilder();
    for(String word : allWords) {
        if(!stopwords.contains(word)) {
            builder.append(word);
            builder.append(' ');
        }
    }
    return builder.toString().trim();
}

次に、 removeAll（）ベンチマークがあります。

@Benchmark
public String removeAll() {
    ArrayList<String> allWords = 
      Stream.of(data.split(" "))
            .collect(Collectors.toCollection(ArrayList<String>::new));
    allWords.removeAll(stopwords);
    return allWords.stream().collect(Collectors.joining(" "));
}

最後に、 replaceRegex（）のベンチマークを追加します。

@Benchmark
public String replaceRegex() {
    return data.replaceAll(stopwordsRegex, "");
}

そして、これが私たちのベンチマークの結果です：

Benchmark                           Mode  Cnt   Score    Error  Units
removeAll                           avgt   60   7.782 ±  0.076  ms/op
removeManually                      avgt   60   8.186 ±  0.348  ms/op
replaceRegex                        avgt   60  42.035 ±  1.098  ms/op

Collection.removeAll（）を使用すると実行時間が最も速くなり、正規表現を使用すると最も遅くなるようです。