0. 前言

本文主要介绍了es聚合操作中的pipeline aggregations,pipeline aggregations又称为管道聚合，其作用是产生的输出结果而不是文档集，用于向输出树添加信息。

pipeline聚合的例子
通过cardinality进行去重 es中的cardinality metric是对每个bucket中的指定的field进行去重，得到去重后的count，类似于MySQL 中的count(distinct) 按照sold_date按照每年分组，然后去重计算品牌
GET /television/_search
{
  "size": 0,
  "aggs": {
 "months": {
   "date_histogram": {
     "field": "sold_date",
     "calendar_interval": "1y"
   },
   "aggs": {
     "distinct_brand": {
       "cardinality": {
         "field": "brand"
       }
     }
   }
 }
  }
}

结果如下：通过cardinality进行去重.png

cardinality的参数优化 cardinality类似于count(distinct)，不是完全准确的，可能还有5%的错误率，性能在100ms左右
- precision_threshold优化准确率和内存开销 precision_threshold表示在多少个unique value以内，cardinality，几乎保证100%准确。例如：利用brand进行去重，如果brand的unique value，在100个以内，小米，长虹等，它可以保证几乎保证100%准确。 cardinality算法，会占用precision_threshold * 8 byte 内存消耗，如precision_threshold=100，则占用字节数为：100 * 8 = 800个字节它占用内存很小。而且unique value如果的确在值以内，那么可以确保100%准确。precision_threshold=100，也表示数百万的unique value，错误率在5%以内。 precision_threshold，值设置的越大，占用内存越大，1000 * 8 = 8000 / 1000 = 8KB，可以确保更多unique value的场景下，100%的准确。在使用 precision_threshold参数的时候结合使用场景、内存考虑脚本示例：
```
#cartinality
GET /television/_search
{
"size": 0,
"aggs": {
 "distinct_brand": {
 "cardinality": {
   "field": "brand",
   "precision_threshold": 100
 }
 }
}
}
```

HyperLogLog++ (HLL)算法性能优化 cardinality的底层算法使用了HLL，如果要去优化cardinality需要优化HLL算法的性能底层会对所有的unique value取hash值，通过hash值近似去求distinct count，所以还有一些误差。如果要优化它的性能，默认情况下，发送一个 cardinality的时候，它会动态的对所有的field value取hash值。如果要优化性能，可以将取hash值的操作前移到建立索引的时候，例如：索引映射脚本示例：
```
PUT /television/
{
"mappings": {
  "sales": {
    "properties": {
      "brand": {
        "type": "text",
        "fields": {
          "hash": {
            "type": "murmur3" 
          }
        }
      }
    }
  }
}
}
```

优化后的distinct使用聚合示例

GET /television/_search
{
    "size" : 0,
    "aggs" : {
        "distinct_brand" : {
            "cardinality" : {
              "field" : "brand.hash",
              "precision_threshold" : 100 
            }
        }
    }
}

percentiles百分比算法以及网站访问延时统计记录网站每次请求的访问的耗时，需要统计tp50，tp90，tp99 tp50，tp90，tp99代表的意义分别如下：

tp50：50%的请求的耗时最长在多长时间
tp90：90%的请求的耗时最长在多长时间
tp99：99%的请求的耗时最长在多长时间

例子如下：

tp50，tp90，tp99数据批量插入一批数据，如下：

POST /website/_bulk
{"index":{}}
{"latency":105,"province":"江苏","timestamp":"2020-10-28"}
{"index":{}}
{"latency":83,"province":"江苏","timestamp":"2020-10-29"}
{"index":{}}
{"latency":92,"province":"江苏","timestamp":"2020-10-29"}
{"index":{}}
{"latency":112,"province":"江苏","timestamp":"2020-10-28"}
{"index":{}}
{"latency":68,"province":"江苏","timestamp":"2020-10-28"}
{"index":{}}
{"latency":76,"province":"江苏","timestamp":"2020-10-29"}
{"index":{}}
{"latency":101,"province":"新疆","timestamp":"2020-10-28"}
{"index":{}}
{"latency":275,"province":"新疆","timestamp":"2020-10-29"}
{"index":{}}
{"latency":166,"province":"新疆","timestamp":"2020-10-29"}
{"index":{}}
{"latency":654,"province":"新疆","timestamp":"2020-10-28"}
{"index":{}}
{"latency":389,"province":"新疆","timestamp":"2020-10-28"}
{"index":{}}
{"latency":302,"province":"新疆","timestamp":"2020-10-29"}

查看tp50,tp90,tp99的数据：

GET /website/_search
{
"size": 0,
"aggs": {
 "lantency_percentiles": {
 "percentiles": {
   "field": "latency",
   "percents": [
     50,
     90,
     99
   ]
 }
 },
 "lantency_avg": {
 "avg": {
   "field": "latency"
 }
 }
}
}

结果数据如下：

{
"took" : 0,
"timed_out" : false,
"_shards" : {
 "total" : 1,
 "successful" : 1,
 "skipped" : 0,
 "failed" : 0
},
"hits" : {
 "total" : {
 "value" : 12,
 "relation" : "eq"
 },
 "max_score" : null,
 "hits" : [ ]
},
"aggregations" : {
 "lantency_percentiles" : {
 "values" : {
   "50.0" : 108.5,
   "90.0" : 468.5000000000002,
   "99.0" : 654.0
 }
 }
}
}

tp50，tp90，tp99数据，

percentiles rank metric以及网站访问时延SLA统计 SLA: 提供的网络服务标准网站的提供的访问延时的SLA，要确保所有的请求100%都必须在200ms以内，如果超过1s，代表网站的访问性能和用户体验急剧下降, percentile ranks metric比percentile还要常用，下面按照品牌分组，计算，电视机，售价在1000占比，2000占比，3000占比
- 例子：统计在在200ms以内的，有百分之多少，在1000毫秒以内的有百分之多少
```
GET /website/_search
{
"size": 0,
"aggs": {
 "group_by_province": {
 "terms": {
   "field": "province.keyword"
 },
 "aggs": {
   "latency_percentile_ranks": {
     "percentile_ranks": {
       "field": "latency",
       "values": [
         200,
         1000
       ]
     }
   }
 }
 }
}
}
```
  结果如下： ``` { “took” : 76, “timed_out” : false, “_shards” : { “total” : 1, “successful” : 1, “skipped” : 0, “failed” : 0 }, “hits” : { “total” : { “value” : 12, “relation” : “eq” }, “max_score” : null, “hits” : [ ] }, “aggregations” : { “group_by_province” : { “doc_count_error_upper_bound” : 0, “sum_other_doc_count” : 0, “buckets” : [ { “key” : “新疆”, “doc_count” : 6, “latency_percentile_ranks” : { “values” : { “200.0” : 29.40613026819923, “1000.0” : 100.0 } } }, { “key” : “江苏”, “doc_count” : 6, “latency_percentile_ranks” : { “values” : { “200.0” : 100.0, “1000.0” : 100.0 } } } ] } } }

```

ES聚合操作之Pipeline aggregations

Apr 19, 2022 • KeWeiZhou

0. 前言

pipeline聚合的例子