Skip to content

聚合搜索 Aggregations

字数: 0 字 时长: 0 分钟

ES 是一个搜索和分析引擎,聚合搜索就是 ES 进行数据分析的核心手段,ES 中分为三种聚合类型:

  • 桶聚合(Bucket Aggregations): 类比 SQL 中的 group by 语句,对文档进行分组,并统计分组内的文档数量。
  • 指标聚合(Metric Aggregations): 用于统计某个指标,比如最大值、最小值、平均值等,可以结合桶聚合一起使用。
  • 管道聚合(Pipeline Aggregations): 用于对指标聚合的结果进行二次处理,比如求和、平均值、百分比等。

准备测试数据:

json
PUT goods/_doc/1
{
  "brand" : "小米",
  "name":"红米手机",
  "level": "低端",
  "price": 1099,
  "description":"红米手机,老百姓的平价手机"
}

PUT goods/_doc/2
{
  "brand" : "vivo",
  "name":"IQ 手机",
  "level": "低端",
  "price": 1299,
  "description":"低价拍照手机"
}

PUT goods/_doc/3
{
  "brand" : "vivo",
  "name":"VIVO 拍照手机",
  "level": "高端",
  "price": 3299,
  "description":"买拍照手机,我选 vivo"
}

PUT goods/_doc/4
{
  "brand" : "小米",
  "name":"小米手机",
  "level": "中端",
  "price": 2499,
  "description":"小米手机,性价比的神"
}

PUT goods/_doc/5
{
  "brand" : "华为",
  "name":"华为 mate 手机",
  "level": "高端",
  "price": 5099,
  "description":"mate手机,旗舰高端大气"
}

PUT goods/_doc/6
{
  "brand" : "华为",
  "name":"华为保时捷手机",
  "level": "旗舰",
  "price": 10099,
  "description":"保时捷设计,奢华之选"
}


PUT goods/_doc/7
{
  "brand" : "华为",
  "name":"freebuds pro",
  "level": "旗舰",
  "price": 1299,
  "description":"华为降噪耳机"
}

桶聚合

按照商品品牌分桶: terms 聚合

json
GET goods/_search
{
  "aggs": {
    "brand_agg": {
      "terms": {
        "field": "brand.keyword"
      }
    }
  }
}

//结果:

"buckets": [
    {
        "key": "华为",
        "doc_count": 3
    },
    {
        "key": "vivo",
        "doc_count": 2
    },
    {
        "key": "小米",
        "doc_count": 2
    }
]

按照商品品牌和定位聚合: multi_terms 聚合

json
GET goods/_search
{
  "size": 0,
  "aggs": {
    "terms_agg": {
      "multi_terms": {
        "terms":[
          {
            "field":"brand.keyword"
          },
          {
            "field" : "level.keyword"
          }
        ]
      }
    }
  }
}

指标聚合

统计商品中 最贵最便宜平均价格 三个指标

json
GET goods/_search
{
  "size": 0,
  "aggs": {
    "最贵": {
      "max": {
        "field": "price"
      }
    },
    "最便宜" : {
      "min": {
        "field": "price"
      }
    },
    "平均价格" : {
      "avg": {
        "field": "price"
      }
    }
  }
}

//结果

"aggregations": {
    "最贵": {
      "value": 10099
    },
    "最便宜": {
      "value": 1099
    },
    "平均价格": {
      "value": 3527.5714285714284
    }
}

也可以统计价格的所有指标 : stats

json
GET goods/_search
{
  "size": 0,
  "aggs": {
    "price_stats": {
      "stats": {
        "field": "price"
      }
    }
  }
}

//结果
"price_stats": {
    "count": 7, //文档数量
    "min": 1099, //价格最小值
    "max": 10099, //价格最大值
    "avg": 3527.5714285714284, //价格平均值
    "sum": 24693 //价格总和
}

单纯统计数量 : value_count

json
GET goods/_search
{
  "size": 0,
  "aggs": {
    "price_value_count": {
      "value_count": {
        "field": "price"
      }
    }
  }
}

//结果

"price_value_count": {
  "value": 7
}

管道聚合

统计平均价格最低的商品品牌 :

  • 1、对品牌分桶
  • 2、对每个桶进行平均价格统计
  • 3、找出平均价格最低的桶
json

GET goods/_search
{
  "size": 0,
  "aggs": {
    "agg_pip": {
      "terms": {
        "field": "brand.keyword"
      },
      "aggs": {
        "price_avg": {
          "avg": {
            "field": "price"
          }
        }
      }
    },
    "min_avg_brand":{
      "min_bucket": {
        // `buckets_path` 路径相对管道聚合的路径,不是绝对路径
        "buckets_path": "agg_pip>price_avg"
      }
    }
  }
}

//结果
"aggregations": {
    "agg_pip": {
    "doc_count_error_upper_bound": 0,
    "sum_other_doc_count": 0,
    "buckets": [
        {
            "key": "华为",
            "doc_count": 3,
            "price_avg": {
              "value": 5499
            }
        },
        {
            "key": "vivo",
            "doc_count": 2,
            "price_avg": {
              "value": 2299
            }
        },
        {
            "key": "小米",
            "doc_count": 2,
            "price_avg": {
              "value": 1799
            }
        }
    ]
    },
    "min_avg_brand": {
    "value": 1799,
    "keys": [
      "小米"
    ]
    }
}

聚合数据类型

doc values

doc values 是正排索引的基本数据结构之一,其存在是为了提升排序和聚合效率,所有不分词字段都会默认开启,不支持 textannotated_text 字段 (倒排索引)

如果确定不需要对字段进行排序或聚合,也不需要通过脚本访问字段值,则可以禁用 doc values ,以节省磁盘空间

fielddata

fielddata 是查询时内存数据结构,当没有 doc values 的字段需要聚合时,比如 text 类型的字段,可以打开 fielddata

注意:如无必要,不要启用,因为它会临时在内存中建立正排索引,发生在 JVM 堆内存中,非常耗费内存资源

json
PUT goods/_mapping
{
  "properties": {
        "name": {
          "type": "text",
          "analyzer": "ik_max_word",
          // 修改 name 字段的 mapping 开启 fielddata,然后就可以对 name 字段进行聚合查询或排序了
          "fielddata" : true
        },
        "price": {
          "type": "long"
        }
      }
}

聚合查询中的分页和排序

  • 按品牌排序,只展示两个品牌的商品数量
json
GET goods/_search
{
  "size": 0,
  "aggs": {
    "agg_brand": {
      "terms": {
        "field": "brand.keyword",
        "size": 2,
        "order": {
          "_key": "asc"
        }
      }
    }
  }
}
  • 多字段排序 : 按照数量和品牌排序
json
GET goods/_search
{
  "size": 0,
  "aggs": {
    "agg_brand": {
      "terms": {
        "field": "brand.keyword",
        "order": [
          {
            "_count" : "desc"
          },
          {
            "_key" : "desc"
          }
        ]
      }
    }
  }
}
  • 按照内层嵌套聚合排序 : 按照平均价格对手机品牌由低到高排序
json
GET goods/_search
{
  "size": 0,
  "aggs": {
    "agg_term_brand": {
      "terms": {
        "field": "brand.keyword",
        "order": {
          "agg_avg_price": "asc"
        }
      },
      "aggs": {
        "agg_avg_price": {
          "avg": {
            "field": "price"
          }
        }
      }
    }
  }
}

// 结果

"buckets": [
    {
        "key": "小米",
        "doc_count": 2,
        "agg_avg_price": {
          "value": 1799
        }
    },
    {
        "key": "vivo",
        "doc_count": 2,
        "agg_avg_price": {
          "value": 2299
        }
    },
    {
        "key": "华为",
        "doc_count": 3,
        "agg_avg_price": {
          "value": 5499
        }
    }
]

过滤器

Filter

  • 分别统计所有商品的平均价格和高端商品的平均价格
json
GET goods/_search
{
  "size": 0,
  "aggs": {
    "all_avg_price": {
      "avg": {
        "field": "price"
      }
    },
    "hight_level" : {
      "filter": {
        "term": {
          "level.keyword": "高端"
        }
      },
      "aggs": {
        "高端平均价格": {
          "avg": {
            "field": "price"
          }
        }
      }
    }
  }
}

Filters

  • 按品牌分桶,希望统计华为的数量和小米的数量还有其他所有商品的数量
json
GET goods/_search?filter_path=aggregations
{
  "size": 0,
  "aggs": {
    "test_filters": {
      "filters": {
        "other_bucket_key": "other",
        "filters": {
          "huawei":{
            "match" : {
              "name" : "华为"
            }
          },
          "agg_xiaomi":{
            "match" : {
              "name" : "小米"
            }
          }
        }
      }
    }
  }
}

//结果

"buckets": {
    "agg_xiaomi": {
      "doc_count": 1
    },
    "huawei": {
      "doc_count": 2
    },
    "other": {
      "doc_count": 4
    }
}

全局聚合 global

  • 求价格在 2000 ~ 8000 商品的平均价格以及所有商品的平均价格
json
GET goods/_search
{
  "query": {
    "range": {
      "price": {
        "gte": 2000,
        "lte": 8000
      }
    }
  },
  "aggs": {
    // 这个聚合受 query 的 range 影响
    "avg_price": {
      "avg": {
        "field": "price"
      }
    },
    // 而这个聚合由于加了 global ,统计的是所有商品的平均价格
    "all_avg_price": {
      "global": {},
      "aggs": {
        "agg_avg": {
          "avg": {
            "field": "price"
          }
        }
      }
    }
  }
}

后置过滤 Post Filter

Post Filterquery 的区别在于: query 会影响聚合查询结果,而 post_filter 不会

json
GET goods/_search
{
  "aggs": {
    "brand_agg" : {
      "terms": {
        "field": "brand.keyword"
      }
    }
  },
  "post_filter": {
    "range": {
      "price": {
        "gte": 5000
      }
    }
  }
}

对聚合结果查询 Top Hits

  • 按品牌聚合,取每个桶的第一条商品文档
json
GET goods/_search
{
  "size": 0,
  "aggs": {
    "brand_agg": {
      "terms": {
        "field": "brand.keyword"
      },
      "aggs": {
        "top_agg": {
          "top_hits": {
            "size": 1
          }
        }
      }
    }
  }
}

// 结果节选
"buckets": [
    {
    "key": "华为",
    "doc_count": 3,
    "top_agg": {
    "hits": {
    "total": {
        "value": 3,
        "relation": "eq"
    },
    "max_score": 1,
    "hits": [
        {
        "_index": "goods",
        "_id": "7",
        "_score": 1,
        "_source": {
        "brand": "华为",
        "name": "freebuds pro",
        "level": "旗舰",
        "price": 1299,
        "description": "华为降噪耳机"
        }
    }
]

其他常用聚合函数