IT序号网

es的分词器analyzer

xmjava 2021年05月25日 架构师 395 0

analyzer  

分词器使用的两个情形:  
1,Index time analysis.  创建或者更新文档时,会对文档进行分词
2,Search time analysis.  查询时,对查询语句分词

- 查询时通过analyzer指定分词器

GET test_index/_search 
{ 
  "query": { 
    "match": { 
      "name": { 
        "query": "lin", 
        "analyzer": "standard" 
      } 
    } 
  } 
}

- 创建index mapping时指定search_analyzer

PUT test2 
{ 
  "mappings": { 
    "properties": { 
        "title":{ 
          "type": "text", 
          "analyzer": "whitespace", 
          "search_analyzer": "standard" 
        } 
      } 
  } 
}
# 不指定分词时,会使用默认的standard

注意:

  •  明确字段是否需要分词,不需要分词的字段将type设置为keyword,可以节省空间和提高写性能。

_analyzer api    

GET _analyze 
{ 
  "analyzer": "standard", 
  "text": "this is a test" 
}
# 可以查看text的内容使用standard分词后的结果
{ 
  "tokens" : [ 
    { 
      "token" : "this", 
      "start_offset" : 0, 
      "end_offset" : 4, 
      "type" : "<ALPHANUM>", 
      "position" : 0 
    }, 
    { 
      "token" : "is", 
      "start_offset" : 5, 
      "end_offset" : 7, 
      "type" : "<ALPHANUM>", 
      "position" : 1 
    }, 
    { 
      "token" : "a", 
      "start_offset" : 8, 
      "end_offset" : 9, 
      "type" : "<ALPHANUM>", 
      "position" : 2 
    }, 
    { 
      "token" : "test", 
      "start_offset" : 10, 
      "end_offset" : 14, 
      "type" : "<ALPHANUM>", 
      "position" : 3 
    } 
  ] 
}

设置analyzer

PUT test3 
{ 
  "settings": { 
    "analysis": {    
      "analyzer": {      
        "my_analyzer":{   
          "type":"standard",    
          "stopwords":"_english_" 
        } 
      } 
    } 
  }, 
  "mappings": { 
    "properties": { 
        "my_text":{ 
          "type": "text", 
          "analyzer": "standard", 
          "fields": { 
            "english":{ 
              "type": "text", 
              "analyzer": "my_analyzer" 
            } 
          } 
        } 
    } 
  } 
}

运行结果:

POST test3/_analyze 
{ 
  "field": "my_text", 
  "text": ["The test message."] 
} 
 
{ 
  "tokens" : [ 
    { 
      "token" : "the", 
      "start_offset" : 0, 
      "end_offset" : 3, 
      "type" : "<ALPHANUM>", 
      "position" : 0 
    }, 
    { 
      "token" : "test", 
      "start_offset" : 4, 
      "end_offset" : 8, 
      "type" : "<ALPHANUM>", 
      "position" : 1 
    }, 
    { 
      "token" : "message", 
      "start_offset" : 9, 
      "end_offset" : 16, 
      "type" : "<ALPHANUM>", 
      "position" : 2 
    } 
  ] 
} 
 
 
POST test3/_analyze 
{ 
  "field": "my_text.english",  
  "text": ["The test message."] 
} 
{ 
  "tokens" : [ 
    { 
      "token" : "test", 
      "start_offset" : 4, 
      "end_offset" : 8, 
      "type" : "<ALPHANUM>", 
      "position" : 1 
    }, 
    { 
      "token" : "message", 
      "start_offset" : 9, 
      "end_offset" : 16, 
      "type" : "<ALPHANUM>", 
      "position" : 2 
    } 
  ] 
}

ES内置了很多种analyzer。比如:

  • standard  由以下组成
    • tokenizer:Standard Tokenizer
    • token filter:Standard Token Filter,Lower Case Token Filter,Stop Token Filter 
      analyzer API测试 : 
      POST _analyze 
      { 
        "analyzer": "standard", 
        "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone." 
      }

      得到结果:      

{ 
  "tokens" : [ 
    { 
      "token" : "the", 
      "start_offset" : 0, 
      "end_offset" : 3, 
      "type" : "<ALPHANUM>", 
      "position" : 0 
    }, 
    { 
      "token" : "2", 
      "start_offset" : 4, 
      "end_offset" : 5, 
      "type" : "<NUM>", 
      "position" : 1 
    }, 
    { 
      "token" : "quick", 
      "start_offset" : 6, 
      "end_offset" : 11, 
      "type" : "<ALPHANUM>", 
      "position" : 2 
    }, 
    { 
      "token" : "brown", 
      "start_offset" : 12, 
      "end_offset" : 17, 
      "type" : "<ALPHANUM>", 
      "position" : 3 
    }, 
    { 
      "token" : "foxes", 
      "start_offset" : 18, 
      "end_offset" : 23, 
      "type" : "<ALPHANUM>", 
      "position" : 4 
    }, 
    { 
      "token" : "jumped", 
      "start_offset" : 24, 
      "end_offset" : 30, 
      "type" : "<ALPHANUM>", 
      "position" : 5 
    }, 
    { 
      "token" : "over", 
      "start_offset" : 31, 
      "end_offset" : 35, 
      "type" : "<ALPHANUM>", 
      "position" : 6 
    }, 
    { 
      "token" : "the", 
      "start_offset" : 36, 
      "end_offset" : 39, 
      "type" : "<ALPHANUM>", 
      "position" : 7 
    }, 
    { 
      "token" : "lazy", 
      "start_offset" : 40, 
      "end_offset" : 44, 
      "type" : "<ALPHANUM>", 
      "position" : 8 
    }, 
    { 
      "token" : "dog's", 
      "start_offset" : 45, 
      "end_offset" : 50, 
      "type" : "<ALPHANUM>", 
      "position" : 9 
    }, 
    { 
      "token" : "bone", 
      "start_offset" : 51, 
      "end_offset" : 55, 
      "type" : "<ALPHANUM>", 
      "position" : 10 
    } 
  ] 
}

 

  • whitespace  空格为分隔符
    POST _analyze 
    { 
      "analyzer": "whitespace", 
      "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone." 
    } 
    -->  [ The,2,QUICK,Brown-Foxes,jumped,over,the,lazy,dog's,bone. ]
  • simple     
    POST _analyze 
    { 
      "analyzer": "simple", 
      "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone." 
    } 
    ---> [ the, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]

  • stop   默认stopwords用_english_ 
    POST _analyze 
    { 
      "analyzer": "stop", 
      "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone." 
    } 
    -->[ quick, brown, foxes, jumped, over, lazy, dog, s, bone ] 
    可选参数: 
    # stopwords 
    # stopwords_path
  • keyword  不分词的
    POST _analyze 
    { 
      "analyzer": "keyword", 
      "text": ["The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."] 
    } 
    得到  "token": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone." 一条完整的语句

 

第三方analyzer插件---中文分词(ik分词器)

es内置很多分词器,但是对中文分词并不友好,例如使用standard分词器对一句中文话进行分词,会分成一个字一个字的。这时可以使用第三方的Analyzer插件,比如 ik、pinyin等。这里以ik为例

1,首先安装插件,重启es:

# bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v6.3.0/elasticsearch-analysis-ik-6.3.0.zip 
# /etc/init.d/elasticsearch restart

2,使用示例:

GET _analyze 
{ 
  "analyzer": "ik_max_word", 
  "text": "你好吗?我有一句话要对你说呀。" 
}

GET _analyze
{
  "analyzer": "ik_smart",
  "text": "你好吗?我有一句话要对你说呀。"
}

参考:https://github.com/medcl/elasticsearch-analysis-ik

还可以用内置的 character filter, tokenizer, token filter 组装一个analyzer(custom analyzer)

  • custom  定制analyzer,由以下几部分组成
    • 0个或多个e character filters
    • 1个tokenizer
    • 0个或多个 token filters

    

PUT t_index 
{ 
  "settings": { 
    "analysis": { 
      "analyzer": { 
        "my_analyzer":{ 
          "type":"custom", 
          "tokenizer":"standard", 
          "char_filter":["html_strip"], 
          "filter":["lowercase"] 
        } 
      } 
    } 
  } 
} 
POST t_index/_analyze 
{ 
  "analyzer": "my_analyzer", 
  "text": ["The 2 QUICK Brown-Foxes jumped over the lazy dog's <b> bone.</b>"] 
} 
得到:[the,2,quick,brown,foxes,jumped,over,the,lazy,dog's,bone]

自定义分词器

自定义分词需要在索引的配置中设定,如下所示:

PUT test_index 
{ 
  "settings": { 
    "analysis": {    # 分词设置,可以自定义 
      "char_filter": {},   #char_filter  关键字 
      "tokenizer": {},    #tokenizer 关键字 
      "filter": {},     #filter  关键字 
      "analyzer": {}    #analyzer 关键字 
    } 
  } 
}

character filter  在tokenizer之前对原始文本进行处理,比如增加,删除,替换字符等

会影响后续tokenizer解析的position和offset信息

  • html strip  除去html标签和转换html实体
    • 参数:escaped_tags不删除的标签

  

POST _analyze 
{ 
  "tokenizer": "keyword", 
  "char_filter": ["html_strip"], 
  "text": ["<p>I&apos;m so <b>happy</b>!</p>"] 
} 
得到: 
      "token": """ 
 
I'm so happy! 
 
""" 
#配置示例 
PUT t_index 
{ 
  "settings": { 
    "analysis": { 
      "analyzer": {  #关键字 
        "my_analyzer":{   #自定义analyzer 
          "tokenizer":"keyword", 
          "char_filter":["my_char_filter"] 
        } 
      }, 
      "char_filter": {  #关键字 
        "my_char_filter":{   #自定义char_filter 
          "type":"html_strip", 
          "escaped_tags":["b"]  #不从文本中删除的HTML标记数组 
        } 
      }}}} 
POST t_index/_analyze 
{ 
  "analyzer": "my_analyzer", 
  "text": ["<p>I&apos;m so <b>happy</b>!</p>"] 
} 
得到: 
      "token": """ 
 
I'm so <b>happy</b>! 
 
""",
  • mapping    映射类型,以下参数必须二选一
    • mappings 指定一组映射,每个映射格式为 key=>value
    • mappings_path 绝对路径或者相对于config路径   key=>value
  • PUT t_index 
    { 
      "settings": { 
        "analysis": { 
          "analyzer": {     #关键字 
            "my_analyzer":{   #自定义分词器 
              "tokenizer":"standard", 
              "char_filter":"my_char_filter"   
            } 
          }, 
          "char_filter": {    #关键字 
            "my_char_filter":{  #自定义char_filter 
              "type":"mapping",  
              "mappings":[       #指明映射关系 
                ":)=>happy", 
                ":(=>sad" 
              ] 
            }}}}} 
    POST t_index/_analyze 
    { 
      "analyzer": "my_analyzer", 
      "text": ["i am so :)"] 
    }
    得到 [i,am,so,happy]
  • pattern replace
    • pattern参数  正则
    • replacement 替换字符串 可以使用$1..$9
    • flags  正则标志

tokenizer  将原始文档按照一定规则切分为单词

  • standard
    • 参数:max_token_length,最大token长度,默认是255

    

PUT t_index 
{ 
  "settings": { 
    "analysis": { 
      "analyzer": { 
        "my_analyzer":{ 
          "tokenizer":"my_tokenizer" 
        } 
      }, 
      "tokenizer": {  
        "my_tokenizer":{ 
          "type":"standard", 
          "max_token_length":5       
        }}}}} 
POST t_index/_analyze 
{ 
  "analyzer": "my_analyzer", 
  "text": ["The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."] 
} 
得到   [ The, 2, QUICK, Brown, Foxes, jumpe, d, over, the, lazy, dog's, bone ] 
# jumped 长度为6  在5这个位置被分割
  • letter    非字母时分成多个terms

  

POST _analyze 
{ 
  "tokenizer": "letter", 
  "text": ["The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."] 
} 
得到 [ The, QUICK, Brown, Foxes, jumped, over, the, lazy, dog, s, bone ]
  • lowcase  跟letter tokenizer一样 ,同时将字母转化成小写

  

POST _analyze 
{ 
  "tokenizer": "lowercase", 
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone." 
} 
得到  [ the, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]
  • whitespace   按照空白字符分成多个terms
    • 参数:max_token_length
POST _analyze 
{ 
  "tokenizer": "whitespace", 
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone." 
} 
得到 [ The, 2, QUICK, Brown-Foxes, jumped, over, the, lazy, dog's, bone. ]
  • keyword   空操作,输出完全相同的文本
    • 参数:buffer_size,单词一个term读入缓冲区的长度,默认256
POST _analyze 
{ 
  "tokenizer": "keyword", 
  "text": ["The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."] 
} 
得到"token": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone." 一个完整的文本

token filter   针对tokenizer 输出的单词进行增删改等操作

  • lowercase  将输出的单词转化成小写
POST _analyze 
{ 
  "filter": ["lowercase"], 
  "text": ["The 2 QUICK Brown-Foxes jumped over the lazy dog's  bone"] 
} 
---> 
"token": "the 2 quick brown-foxes jumped over the lazy dog's  bone" 
 
PUT t_index 
{ 
  "settings": { 
    "analysis": { 
      "analyzer": { 
        "my_analyzer":{ 
          "type":"custom",  
          "tokenizer":"standard",  
          "filter":"lowercase" 
        } 
      } 
    } 
  } 
} 
POST t_index/_analyze 
{ 
  "analyzer": "my_analyzer", 
    "text": ["The 2 QUICK Brown-Foxes jumped over the lazy dog's  bone"] 
}
  • stop  从token流中删除stop words 。
    参数有:
    # stopwords   要使用的stopwords, 默认_english_ 
    # stopwords_path 
    # ignore_case   设置为true则为小写,默认false
    # remove_trailing
    PUT t_index 
    { 
      "settings": { 
        "analysis": { 
          "analyzer": { 
            "my_analyzer":{ 
              "type":"custom", 
              "tokenizer":"standard", 
              "filter":"my_filter" 
            } 
          }, 
          "filter": { 
            "my_filter":{ 
              "type":"stop", 
              "stopwords":["and","or","not"] 
            } 
          } 
        } 
      } 
    } 
    POST t_index/_analyze 
    { 
      "analyzer": "my_analyzer", 
      "text": ["lucky and happy not sad"] 
    }
    -------------->
    [lucky,happy,sad]


评论关闭
IT序号网

微信公众号号:IT虾米 (左侧二维码扫一扫)欢迎添加!