分词器的种类很多,默认就有几种不同的实现。对相同内容使用不同的分词器,会使最后创建的索引不同,搜索结果自然也不相同。因此使用一个较为合适的分词器对于全文检索很关键。对于中文分词器来说,可以使用lucene-analyzers-smartcn,通常我们也可以根据自身情况实现一个分词器。
如果看不懂本文,可以先了解,看完后面几篇文章后再回过来看。之后的文章如果没有特殊说明,均使用下面的依赖。
1. 简单使用lucene-analyzers-smartcn分词器
首先是Lucene的一些基本依赖:
<!-- https://mvnrepository.com/artifact/org.apache.lucene/lucene-core -->
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-core</artifactId>
<version>8.7.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.lucene/lucene-analyzers-common -->
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-analyzers-common</artifactId>
<version>8.7.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.lucene/lucene-queryparser -->
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-queryparser</artifactId>
<version>8.7.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.lucene/lucene-highlighter -->
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-highlighter</artifactId>
<version>8.7.0</version>
</dependency>
然后下面这个是中文分词器所使用的依赖,一般来说上面的依赖都要有,下面这个可以根据情况更换。
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-analyzers-smartcn</artifactId>
<version>8.7.0</version>
</dependency>
测试代码1:
public class Test {
public static void main(String[] args) {
try {
String text = "通过docker容器部署halo博客系统,方便快捷,全程只要1分钟";
Analyzer analyzer = new SmartChineseAnalyzer();
TokenStream tokenStream = analyzer.tokenStream("testField", text);
OffsetAttribute offsetAttribute = tokenStream.addAttribute(OffsetAttribute.class);
tokenStream.reset();
List<String> tokens = new ArrayList<>();
while (tokenStream.incrementToken()) {
tokens.add(offsetAttribute.toString());
}
tokenStream.end();
System.out.println(tokens);
} catch (IOException e) {
e.printStackTrace();
}
}
}
输出结果是:
[通过, docker, 容器, 部署, halo, 博, 客, 系统, 方便, 快捷, 全程, 只要, 1, 分钟]
测试代码2:
public class TestAnalyzer {
public static void main(String[] args) throws IOException {
Analyzer analyzer = new SmartChineseAnalyzer();
TokenStream ts = analyzer.tokenStream("name", "护眼带光源");
ts.reset();
while (ts.incrementToken()) {
System.out.println(ts.reflectAsString(false));
}
}
}
运行结果:
term=护,bytes=[e6 8a a4],startOffset=0,endOffset=1,positionIncrement=1,positionLength=1,type=word,termFrequency=1,keyword=false
term=眼,bytes=[e7 9c bc],startOffset=1,endOffset=2,positionIncrement=1,positionLength=1,type=word,termFrequency=1,keyword=false
term=带,bytes=[e5 b8 a6],startOffset=2,endOffset=3,positionIncrement=1,positionLength=1,type=word,termFrequency=1,keyword=false
term=光源,bytes=[e5 85 89 e6 ba 90],startOffset=3,endOffset=5,positionIncrement=1,positionLength=1,type=word,termFrequency=1,keyword=false
2. 分词器自定义stopwords
停词(stopwords)就是分词器分词的时候过滤掉的词。这些词在搜索的时候搜索不出来。
public class Test {
public static void main(String[] args) {
try {
String text = "通过docker容器部署halo博客系统,方便快捷,全程只要1分钟";
CharArraySet stopWords = CharArraySet.unmodifiableSet(WordlistLoader.getWordSet(IOUtils.getDecodingReader(
Test.class, "/stopwords.txt", StandardCharsets.UTF_8), "//"));
Analyzer analyzer = new SmartChineseAnalyzer(stopWords);
TokenStream tokenStream = analyzer.tokenStream("testField", text);
OffsetAttribute offsetAttribute = tokenStream.addAttribute(OffsetAttribute.class);
tokenStream.reset();
List<String> tokens = new ArrayList<>();
while (tokenStream.incrementToken()) {
tokens.add(offsetAttribute.toString());
}
tokenStream.end();
System.out.println(tokens);
} catch (IOException e) {
e.printStackTrace();
}
}
}
然后在resource目录中创建stopwords.txt文件,把停词内容写进去。每一行一个停词。
输出结果是:
[docker, 容器, 部署, halo, 博, 客, 系统, 方便, 快捷, 全程, 只要, 1, 分钟]
3. 分词器与停词相关源码解析
默认加载下面的停词。
private static final String DEFAULT_STOPWORD_FILE = "stopwords.txt";
private static final String STOPWORD_FILE_COMMENT = "//";
static CharArraySet loadDefaultStopWordSet() throws IOException {
// make sure it is unmodifiable as we expose it in the outer class
return CharArraySet.unmodifiableSet(WordlistLoader.getWordSet(IOUtils
.getDecodingReader(SmartChineseAnalyzer.class, DEFAULT_STOPWORD_FILE,
StandardCharsets.UTF_8), STOPWORD_FILE_COMMENT));
}
public static Reader getDecodingReader(Class<?> clazz, String resource, Charset charSet) throws IOException {
InputStream stream = null;
boolean success = false;
try {
stream = clazz
.getResourceAsStream(resource);
final Reader reader = getDecodingReader(stream, charSet);
success = true;
return reader;
} finally {
if (!success) {
IOUtils.close(stream);
}
}
}
Q.E.D.