聚合搜索 Aggregations
字数: 0 字 时长: 0 分钟
ES 是一个搜索和分析引擎,聚合搜索就是 ES 进行数据分析的核心手段,ES 中分为三种聚合类型:
- 桶聚合(Bucket Aggregations): 类比 SQL 中的 group by 语句,对文档进行分组,并统计分组内的文档数量。
- 指标聚合(Metric Aggregations): 用于统计某个指标,比如最大值、最小值、平均值等,可以结合桶聚合一起使用。
- 管道聚合(Pipeline Aggregations): 用于对指标聚合的结果进行二次处理,比如求和、平均值、百分比等。
准备测试数据:
json
PUT goods/_doc/1
{
"brand" : "小米",
"name":"红米手机",
"level": "低端",
"price": 1099,
"description":"红米手机,老百姓的平价手机"
}
PUT goods/_doc/2
{
"brand" : "vivo",
"name":"IQ 手机",
"level": "低端",
"price": 1299,
"description":"低价拍照手机"
}
PUT goods/_doc/3
{
"brand" : "vivo",
"name":"VIVO 拍照手机",
"level": "高端",
"price": 3299,
"description":"买拍照手机,我选 vivo"
}
PUT goods/_doc/4
{
"brand" : "小米",
"name":"小米手机",
"level": "中端",
"price": 2499,
"description":"小米手机,性价比的神"
}
PUT goods/_doc/5
{
"brand" : "华为",
"name":"华为 mate 手机",
"level": "高端",
"price": 5099,
"description":"mate手机,旗舰高端大气"
}
PUT goods/_doc/6
{
"brand" : "华为",
"name":"华为保时捷手机",
"level": "旗舰",
"price": 10099,
"description":"保时捷设计,奢华之选"
}
PUT goods/_doc/7
{
"brand" : "华为",
"name":"freebuds pro",
"level": "旗舰",
"price": 1299,
"description":"华为降噪耳机"
}
桶聚合
按照商品品牌分桶: terms
聚合
json
GET goods/_search
{
"aggs": {
"brand_agg": {
"terms": {
"field": "brand.keyword"
}
}
}
}
//结果:
"buckets": [
{
"key": "华为",
"doc_count": 3
},
{
"key": "vivo",
"doc_count": 2
},
{
"key": "小米",
"doc_count": 2
}
]
按照商品品牌和定位聚合: multi_terms
聚合
json
GET goods/_search
{
"size": 0,
"aggs": {
"terms_agg": {
"multi_terms": {
"terms":[
{
"field":"brand.keyword"
},
{
"field" : "level.keyword"
}
]
}
}
}
}
指标聚合
统计商品中 最贵 、 最便宜 和 平均价格 三个指标
json
GET goods/_search
{
"size": 0,
"aggs": {
"最贵": {
"max": {
"field": "price"
}
},
"最便宜" : {
"min": {
"field": "price"
}
},
"平均价格" : {
"avg": {
"field": "price"
}
}
}
}
//结果
"aggregations": {
"最贵": {
"value": 10099
},
"最便宜": {
"value": 1099
},
"平均价格": {
"value": 3527.5714285714284
}
}
也可以统计价格的所有指标 : stats
json
GET goods/_search
{
"size": 0,
"aggs": {
"price_stats": {
"stats": {
"field": "price"
}
}
}
}
//结果
"price_stats": {
"count": 7, //文档数量
"min": 1099, //价格最小值
"max": 10099, //价格最大值
"avg": 3527.5714285714284, //价格平均值
"sum": 24693 //价格总和
}
单纯统计数量 : value_count
json
GET goods/_search
{
"size": 0,
"aggs": {
"price_value_count": {
"value_count": {
"field": "price"
}
}
}
}
//结果
"price_value_count": {
"value": 7
}
管道聚合
统计平均价格最低的商品品牌 :
- 1、对品牌分桶
- 2、对每个桶进行平均价格统计
- 3、找出平均价格最低的桶
json
GET goods/_search
{
"size": 0,
"aggs": {
"agg_pip": {
"terms": {
"field": "brand.keyword"
},
"aggs": {
"price_avg": {
"avg": {
"field": "price"
}
}
}
},
"min_avg_brand":{
"min_bucket": {
// `buckets_path` 路径相对管道聚合的路径,不是绝对路径
"buckets_path": "agg_pip>price_avg"
}
}
}
}
//结果
"aggregations": {
"agg_pip": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "华为",
"doc_count": 3,
"price_avg": {
"value": 5499
}
},
{
"key": "vivo",
"doc_count": 2,
"price_avg": {
"value": 2299
}
},
{
"key": "小米",
"doc_count": 2,
"price_avg": {
"value": 1799
}
}
]
},
"min_avg_brand": {
"value": 1799,
"keys": [
"小米"
]
}
}
聚合数据类型
doc values
doc values
是正排索引的基本数据结构之一,其存在是为了提升排序和聚合效率,所有不分词字段都会默认开启,不支持 text
和 annotated_text
字段 (倒排索引)
如果确定不需要对字段进行排序或聚合,也不需要通过脚本访问字段值,则可以禁用 doc values
,以节省磁盘空间
fielddata
fielddata
是查询时内存数据结构,当没有 doc values
的字段需要聚合时,比如 text
类型的字段,可以打开 fielddata
注意:如无必要,不要启用,因为它会临时在内存中建立正排索引,发生在 JVM 堆内存中,非常耗费内存资源
json
PUT goods/_mapping
{
"properties": {
"name": {
"type": "text",
"analyzer": "ik_max_word",
// 修改 name 字段的 mapping 开启 fielddata,然后就可以对 name 字段进行聚合查询或排序了
"fielddata" : true
},
"price": {
"type": "long"
}
}
}
聚合查询中的分页和排序
- 按品牌排序,只展示两个品牌的商品数量
json
GET goods/_search
{
"size": 0,
"aggs": {
"agg_brand": {
"terms": {
"field": "brand.keyword",
"size": 2,
"order": {
"_key": "asc"
}
}
}
}
}
- 多字段排序 : 按照数量和品牌排序
json
GET goods/_search
{
"size": 0,
"aggs": {
"agg_brand": {
"terms": {
"field": "brand.keyword",
"order": [
{
"_count" : "desc"
},
{
"_key" : "desc"
}
]
}
}
}
}
- 按照内层嵌套聚合排序 : 按照平均价格对手机品牌由低到高排序
json
GET goods/_search
{
"size": 0,
"aggs": {
"agg_term_brand": {
"terms": {
"field": "brand.keyword",
"order": {
"agg_avg_price": "asc"
}
},
"aggs": {
"agg_avg_price": {
"avg": {
"field": "price"
}
}
}
}
}
}
// 结果
"buckets": [
{
"key": "小米",
"doc_count": 2,
"agg_avg_price": {
"value": 1799
}
},
{
"key": "vivo",
"doc_count": 2,
"agg_avg_price": {
"value": 2299
}
},
{
"key": "华为",
"doc_count": 3,
"agg_avg_price": {
"value": 5499
}
}
]
过滤器
Filter
- 分别统计所有商品的平均价格和高端商品的平均价格
json
GET goods/_search
{
"size": 0,
"aggs": {
"all_avg_price": {
"avg": {
"field": "price"
}
},
"hight_level" : {
"filter": {
"term": {
"level.keyword": "高端"
}
},
"aggs": {
"高端平均价格": {
"avg": {
"field": "price"
}
}
}
}
}
}
Filters
- 按品牌分桶,希望统计华为的数量和小米的数量还有其他所有商品的数量
json
GET goods/_search?filter_path=aggregations
{
"size": 0,
"aggs": {
"test_filters": {
"filters": {
"other_bucket_key": "other",
"filters": {
"huawei":{
"match" : {
"name" : "华为"
}
},
"agg_xiaomi":{
"match" : {
"name" : "小米"
}
}
}
}
}
}
}
//结果
"buckets": {
"agg_xiaomi": {
"doc_count": 1
},
"huawei": {
"doc_count": 2
},
"other": {
"doc_count": 4
}
}
全局聚合 global
- 求价格在 2000 ~ 8000 商品的平均价格以及所有商品的平均价格
json
GET goods/_search
{
"query": {
"range": {
"price": {
"gte": 2000,
"lte": 8000
}
}
},
"aggs": {
// 这个聚合受 query 的 range 影响
"avg_price": {
"avg": {
"field": "price"
}
},
// 而这个聚合由于加了 global ,统计的是所有商品的平均价格
"all_avg_price": {
"global": {},
"aggs": {
"agg_avg": {
"avg": {
"field": "price"
}
}
}
}
}
}
后置过滤 Post Filter
Post Filter
与 query
的区别在于: query
会影响聚合查询结果,而 post_filter
不会
json
GET goods/_search
{
"aggs": {
"brand_agg" : {
"terms": {
"field": "brand.keyword"
}
}
},
"post_filter": {
"range": {
"price": {
"gte": 5000
}
}
}
}
对聚合结果查询 Top Hits
- 按品牌聚合,取每个桶的第一条商品文档
json
GET goods/_search
{
"size": 0,
"aggs": {
"brand_agg": {
"terms": {
"field": "brand.keyword"
},
"aggs": {
"top_agg": {
"top_hits": {
"size": 1
}
}
}
}
}
}
// 结果节选
"buckets": [
{
"key": "华为",
"doc_count": 3,
"top_agg": {
"hits": {
"total": {
"value": 3,
"relation": "eq"
},
"max_score": 1,
"hits": [
{
"_index": "goods",
"_id": "7",
"_score": 1,
"_source": {
"brand": "华为",
"name": "freebuds pro",
"level": "旗舰",
"price": 1299,
"description": "华为降噪耳机"
}
}
]