我们的新致云使用 docker 部署了一套 ELK 的集群,在最近的调整中,我们有 56 台服务器的日志采集(filebeat)、性能采集(metricbeat)和网络采集(packetbeat)都收集在一个 ElasticSearch 的集群里。

通过观察,ES 的磁盘写入在 4~12M/s 左右,数据量的增长还是挺可观的。目前 ES 服务器只配置了 1T 的磁盘,估计 1~2 个星期左右就能撑爆了。

所以,我们需要使用 ES 官方工具 curator,再结合 monit 监控服务,做一个清理动作。

首先是在 ES 所在的服务器上安装 curator 和 pyOpenSSL
sudo -H pip install -i https://pypi.doubanio.com/simple/ elasticsearch-curator pyOpenSSL

然后在当前用户(比如是 newtouch)的 home 目录下,创建一个 .curator 目录,里面新建一个 curator.yml 文件,内容如下:

---
# Remember, leave a key empty if there is no value.  None will be a string,
# not a Python "NoneType"
client:
  hosts:
    - 127.0.0.1
  port: 9200
  url_prefix:
  use_ssl: False
  certificate:
  client_cert:
  client_key:
  ssl_no_validate: False
  http_auth:
  timeout: 30
  master_only: False

logging:
  loglevel: INFO
  logfile:
  logformat: default
  blacklist: ['elasticsearch', 'urllib3']

接着建立一个 curator_cleanup.yml 文件,存放路径随意,monit 服务能读取就可以。

actions:
  1:
    action: delete_indices
    description: >-
      Delete indices older than 14 days. Ignore the error if the filter does not result in an actionable list of indices (ignore_empty_list) and exit cleanly.
    options:
      ignore_empty_list: True
      disable_action: False
    filters:
    - filtertype: pattern
      kind: prefix
      value: '.*-'
    - filtertype: age
      source: name
      direction: older
      timestring: '%Y.%m.%d'
      unit: days
      unit_count: 14

unit_count数值就是要保留的索引天数,请根据实际情况自行调整。

接下来我们安装一个 monit 服务,https://www.mmonit.com/monit/ 我非常喜欢用它,轻量级,语法清晰灵活。在这里我选择用它的 git 源获取源代码,安装过程和基本设置就不敷述了。重点看监控配置的代码。

check filesystem docker with path /var/lib/docker for every 5 cycles
  if space usage > 80% then exec "curator /home/newtouch/docker/elasticsearch/curator_cleanup.yml"

然后启动 monit 服务就可以让它自己做事去了。

我们的 ELK 集群都是都是基于 Docker 部署的,所以这里监控的是 Docker 的文件目录,一旦磁盘占用率超过 80%,自动触发命令,清除过时的索引。我们来看看效果。

未清理之前的磁盘使用量

Filesystem          Size  Used Avail Use% Mounted on
/dev/mapper/docker  985G  151G  824G  16% /var/lib/docker

执行命令试试(为了效果对比,unit_count 我设为 3)

curator curator_cleanup.yml 
2017-08-11 15:03:15,917 INFO      Preparing Action ID: 1, "delete_indices"
2017-08-11 15:03:15,919 INFO      Trying Action ID: 1, "delete_indices": Delete indices older than 14 days. Ignore the error if the filter does not result in an actionable list of indices (ignore_empty_list) and exit cleanly.
2017-08-11 15:03:16,676 INFO      Deleting selected indices: [u'metricbeat-2017.08.06', u'metricbeat-2017.08.08', u'metricbeat-2017.08.07', u'metricbeat-2017.08.05', u'packetbeat-2017.08.07', u'packetbeat-2017.08.06', u'packetbeat-2017.08.08', u'packetbeat-2017.08.05', u'filebeat-2017.08.06', u'filebeat-2017.08.07', u'filebeat-2017.08.05', u'filebeat-2017.08.08']
2017-08-11 15:03:16,676 INFO      ---deleting index metricbeat-2017.08.06
2017-08-11 15:03:16,676 INFO      ---deleting index metricbeat-2017.08.08
2017-08-11 15:03:16,676 INFO      ---deleting index metricbeat-2017.08.07
2017-08-11 15:03:16,676 INFO      ---deleting index metricbeat-2017.08.05
2017-08-11 15:03:16,676 INFO      ---deleting index packetbeat-2017.08.07
2017-08-11 15:03:16,676 INFO      ---deleting index packetbeat-2017.08.06
2017-08-11 15:03:16,676 INFO      ---deleting index packetbeat-2017.08.08
2017-08-11 15:03:16,676 INFO      ---deleting index packetbeat-2017.08.05
2017-08-11 15:03:16,676 INFO      ---deleting index filebeat-2017.08.06
2017-08-11 15:03:16,676 INFO      ---deleting index filebeat-2017.08.07
2017-08-11 15:03:16,676 INFO      ---deleting index filebeat-2017.08.05
2017-08-11 15:03:16,676 INFO      ---deleting index filebeat-2017.08.08
2017-08-11 15:03:24,569 INFO      Action ID: 1, "delete_indices" completed.
2017-08-11 15:03:24,570 INFO      Job completed.

清理之后的磁盘使用量

Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/nsc-data  985G   48G  927G   5% /var/lib/docker

开心~