分词器的种类很多,默认就有几种不同的实现。对相同内容使用不同的分词器,会使最后创建的索引不同,搜索结果自然也不相同。因此使用一个较为合适的分词器对于全文检索很关键。对于中文分词器来说,可以使用lucene-analyzers-smartcn,通常我们也可以根据自身情况实现一个分词器。

如果看不懂本文,可以先了解,看完后面几篇文章后再回过来看。之后的文章如果没有特殊说明,均使用下面的依赖。

1. 简单使用lucene-analyzers-smartcn分词器

首先是Lucene的一些基本依赖:

        <!-- https://mvnrepository.com/artifact/org.apache.lucene/lucene-core -->
        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-core</artifactId>
            <version>8.7.0</version>
        </dependency>

        <!-- https://mvnrepository.com/artifact/org.apache.lucene/lucene-analyzers-common -->
        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-analyzers-common</artifactId>
            <version>8.7.0</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/org.apache.lucene/lucene-queryparser -->
        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-queryparser</artifactId>
            <version>8.7.0</version>
        </dependency>

        <!-- https://mvnrepository.com/artifact/org.apache.lucene/lucene-highlighter -->
        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-highlighter</artifactId>
            <version>8.7.0</version>
        </dependency>

然后下面这个是中文分词器所使用的依赖,一般来说上面的依赖都要有,下面这个可以根据情况更换。

        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-analyzers-smartcn</artifactId>
            <version>8.7.0</version>
        </dependency>

测试代码1:

public class Test {
    public static void main(String[] args) {
        try {
            String text = "通过docker容器部署halo博客系统,方便快捷,全程只要1分钟";
            Analyzer analyzer = new SmartChineseAnalyzer();
            TokenStream tokenStream = analyzer.tokenStream("testField", text);
            OffsetAttribute offsetAttribute = tokenStream.addAttribute(OffsetAttribute.class);
            tokenStream.reset();
            List<String> tokens = new ArrayList<>();
            while (tokenStream.incrementToken()) {
                tokens.add(offsetAttribute.toString());
            }
            tokenStream.end();
            System.out.println(tokens);
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

输出结果是:

[通过, docker, 容器, 部署, halo, 博, 客, 系统, 方便, 快捷, 全程, 只要, 1, 分钟]

测试代码2:

public class TestAnalyzer {

    public static void main(String[] args) throws IOException {
        Analyzer analyzer = new SmartChineseAnalyzer();
        TokenStream ts = analyzer.tokenStream("name", "护眼带光源");
        ts.reset();
        while (ts.incrementToken()) {
            System.out.println(ts.reflectAsString(false));
        }
    }
}

运行结果:

term=护,bytes=[e6 8a a4],startOffset=0,endOffset=1,positionIncrement=1,positionLength=1,type=word,termFrequency=1,keyword=false
term=眼,bytes=[e7 9c bc],startOffset=1,endOffset=2,positionIncrement=1,positionLength=1,type=word,termFrequency=1,keyword=false
term=带,bytes=[e5 b8 a6],startOffset=2,endOffset=3,positionIncrement=1,positionLength=1,type=word,termFrequency=1,keyword=false
term=光源,bytes=[e5 85 89 e6 ba 90],startOffset=3,endOffset=5,positionIncrement=1,positionLength=1,type=word,termFrequency=1,keyword=false

2. 分词器自定义stopwords

停词(stopwords)就是分词器分词的时候过滤掉的词。这些词在搜索的时候搜索不出来。

public class Test {
    public static void main(String[] args) {
        try {
            String text = "通过docker容器部署halo博客系统,方便快捷,全程只要1分钟";
            CharArraySet stopWords = CharArraySet.unmodifiableSet(WordlistLoader.getWordSet(IOUtils.getDecodingReader(
                    Test.class, "/stopwords.txt", StandardCharsets.UTF_8), "//"));
            Analyzer analyzer = new SmartChineseAnalyzer(stopWords);
            TokenStream tokenStream = analyzer.tokenStream("testField", text);
            OffsetAttribute offsetAttribute = tokenStream.addAttribute(OffsetAttribute.class);
            tokenStream.reset();
            List<String> tokens = new ArrayList<>();
            while (tokenStream.incrementToken()) {
                tokens.add(offsetAttribute.toString());
            }
            tokenStream.end();
            System.out.println(tokens);
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

然后在resource目录中创建stopwords.txt文件,把停词内容写进去。每一行一个停词。
输出结果是:

[docker, 容器, 部署, halo, 博, 客, 系统, 方便, 快捷, 全程, 只要, 1, 分钟]

3. 分词器与停词相关源码解析

默认加载下面的停词。

  private static final String DEFAULT_STOPWORD_FILE = "stopwords.txt";
  
  private static final String STOPWORD_FILE_COMMENT = "//";

  static CharArraySet loadDefaultStopWordSet() throws IOException {
      // make sure it is unmodifiable as we expose it in the outer class
      return CharArraySet.unmodifiableSet(WordlistLoader.getWordSet(IOUtils
          .getDecodingReader(SmartChineseAnalyzer.class, DEFAULT_STOPWORD_FILE,
              StandardCharsets.UTF_8), STOPWORD_FILE_COMMENT));
    }
public static Reader getDecodingReader(Class<?> clazz, String resource, Charset charSet) throws IOException {
    InputStream stream = null;
    boolean success = false;
    try {
      stream = clazz
      .getResourceAsStream(resource);
      final Reader reader = getDecodingReader(stream, charSet);
      success = true;
      return reader;
    } finally {
      if (!success) {
        IOUtils.close(stream);
      }
    }
  }

Q.E.D.


擅长前端的Java程序员