0x00 介绍

本文介绍如何在kubernetes环境中部署一整套完整的prometheus监控系统,涉及内容和知识点较多,部署过程相对上文也复杂不少,如果不是kubernetes环境的同学可以跳过该文,直接阅读下一篇,以免被复杂的操作劝退。本文包括以下内容

  • 部署prometheus
  • 部署grafana
  • 部署node-exporter
  • 部署kube-state-metrics
  • 部署alertmanager

0x01 创建命名空间

01.namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: monitor
  labels:
    name: monitor

0x02 创建服务账号

服务账号(ServiceAccount)用户POD调用api出具的凭证,通过ServiceAccount和ClusterRole的绑定可以授权给账号指定资源的操作权限

02.serviceaccount.prometheus.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: prometheus
  namespace: monitor
  labels:
    app: prometheus
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: prometheus
rules:
<li>apiGroups: [""]</li>
  resources:
  - nodes
  - nodes/proxy
  - services
  - endpoints
  - pods
  verbs: ["get", "list", "watch"]
<li>apiGroups:</li>
  - networking.k8s.io
  resources:
  - ingresses
  verbs: ["get", "list", "watch"]
<li>nonResourceURLs: ["/metrics"]</li>
  verbs: ["get"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: prometheus
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: prometheus
subjects:
<li>kind: ServiceAccount</li>
  name: prometheus
  namespace: monitor

0x03 创建Configmap

03.configmap.prometheus.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-rules
  namespace: monitor
  labels:
    app: prometheus
data:
  cpu-usage.rule: |
    groups:
      - name: NodeCPUUsage
        rules:
          - alert: NodeCPUUsage
            expr: (100 - (avg by (instance) (irate(node_cpu{name="node-exporter",mode="idle"}[5m])) * 100)) > 75
            for: 2m
            labels:
              severity: "page"
            annotations:
              summary: "{{$labels.instance}}: High CPU usage detected"
              description: "{{$labels.instance}}: CPU usage is above 75% (current value is: {{ $value }})"
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-conf
  namespace: monitor
  labels:
    app: prometheus
data:
  prometheus.yml: |-
    global:
      scrape_interval:     15s
      evaluation_interval: 15s
    alerting:
      alertmanagers:
        - static_configs:
            - targets:
    rule_files:
    scrape_configs:
      - job_name: 'kubernetes-machine'
        static_configs:
          - targets: ['10.112.22.14:9100']
            labels:
              node: 'k8s-master-dev-1'
          - targets: ['10.112.22.16:9100']
            labels:
              node: 'k8s-node-dev-1'
      - job_name: 'kubernetes-jvm'
        kubernetes_sd_configs:
          - role: endpoints
        scheme: http
        tls_config:
          ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        relabel_configs:
          - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape, __meta_kubernetes_service_annotation_prometheus_io_jvm_scrape]
            action: keep
            regex: true;true
          - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_app_metrics_patn]
            action: replace
            target_label: __metrics_path__
            regex: (.+)
          - source_labels: [__meta_kubernetes_pod_ip, __meta_kubernetes_service_annotation_prometheus_io_app_metrics_port]
            action: replace
            target_label: __address__
            regex: (.+);(.+)
            replacement: $1:$2
          - source_labels: [__meta_kubernetes_pod_name]
            target_label: instance
          - source_labels: [__meta_kubernetes_endpoint_node_name]
            target_label: node
          - source_labels: [__meta_kubernetes_namespace]
            target_label: namespace
      - job_name: 'kubernetes-nodes'
        scheme: https
        tls_config:
          ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        kubernetes_sd_configs:
          - role: node
        relabel_configs:
          - action: labelmap
            regex: __meta_kubernetes_node_label_(.+)
          - target_label: __address__
            replacement: kubernetes.default.svc:443
          - source_labels: [__meta_kubernetes_node_name]
            regex: (.+)
            target_label: __metrics_path__
            replacement: /api/v1/nodes/${1}/proxy/metrics
      - job_name: 'k8s-cadvisor'
        scheme: https
        tls_config:
          ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        kubernetes_sd_configs:
          - role: node
        relabel_configs:
          - action: labelmap
            regex: __meta_kubernetes_node_label_(.+)
          - target_label: __address__
            replacement: kubernetes.default.svc:443
          - source_labels: [__meta_kubernetes_node_name]
            regex: (.+)
            target_label: __metrics_path__
            replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
      - job_name: kube-state-metrics
        kubernetes_sd_configs:
          - role: endpoints
            namespaces:
              names:
                - monitor
        relabel_configs:
          - source_labels: [__meta_kubernetes_service_label_app_kubernetes_io_name]
            regex: kube-state-metrics
            replacement: $1
            action: keep
          - action: labelmap
            regex: __meta_kubernetes_service_label_(.+)
          - source_labels: [__meta_kubernetes_namespace]
            action: replace
            target_label: k8s_namespace
          - source_labels: [__meta_kubernetes_service_name]
            action: replace
            target_label: k8s_sname
  • prometheus-rules:预警规则文件
  • prometheus-conf:prometheus配置文件

配置文件中配置了以下job

  • kubernetes-jvm:监控java容器
  • kubernetes-machine:监控主机
  • k8s-cadvisor:兼容容器信息
  • kube-state-metrics:监控kubernetes集群信息

百度出来的博客都会配很多job,其实根据自己的需要进行选择,不要盲目照抄。

kubernetes-machine中kubernetes-machine为主机ip,labels:nodes为主机名,需要根据实际情况进行修改,之所以加上nodes这个label方便后面配置监控面板展示和关联

0x04 创建pv

pv指定prometheus数据存储介质,建议使用共享存储,比如NFS,如果存储介质不是共享存储需要在prometheus上指定主机亲和性。

04.pv.yaml
apiVersion: v1
kind: PersistentVolume
metadata:
  name: "prometheus-data-pv"
  labels:
    name: prometheus-data-pv
    release: stable
spec:
  capacity:
    storage: 5Gi
  accessModes:
    - ReadWriteOnce
  persistentVolumeReclaimPolicy: Retain
  hostPath:
    path: /u02/prometheus/data
    type: DirectoryOrCreate
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: prometheus-data-pvc
  namespace: monitor
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 5Gi
  selector:
    matchLabels:
      name: prometheus-data-pv
      release: stable

0x05 部署prometheus

05.deployment.prometheus.yaml
kind: Deployment
apiVersion: apps/v1
metadata:
  labels:
    app: prometheus
  name: prometheus
  namespace: monitor
spec:
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: prometheus
  template:
    metadata:
      labels:
        app: prometheus
    spec:
      serviceAccountName: prometheus
      securityContext:
        runAsUser: 1000
        fsGroup: 1000
      containers:
        - name: prometheus
          image: prom/prometheus:latest
          imagePullPolicy: IfNotPresent
          volumeMounts:
            - mountPath: /prometheus
              name: prometheus-data-volume
            - mountPath: /etc/prometheus/prometheus.yml
              name: prometheus-conf-volume
              subPath: prometheus.yml
            - mountPath: /etc/prometheus/rules
              name: prometheus-rules-volume
          ports:
            - containerPort: 9090
              protocol: TCP
      volumes:
        - name: prometheus-data-volume
          persistentVolumeClaim:
            claimName: prometheus-data-pvc
        - name: prometheus-conf-volume
          configMap:
            name: prometheus-conf
        - name: prometheus-rules-volume
          configMap:
            name: prometheus-rules
      tolerations:
        - key: node-role.kubernetes.io/master
          effect: NoSchedule
---
kind: Service
apiVersion: v1
metadata:
  annotations:
    prometheus.io/scrape: 'true'
  labels:
    app: prometheus
  name: prometheus-service
  namespace: monitor
spec:
  ports:
    - port: 9090
      targetPort: 9090
      nodePort: 30001
  selector:
    app: prometheus
  type: NodePort
---
apiVersion: networking.k8s.io/v1beta1
kind: Ingress
metadata:
  annotations:
    kubernetes.io/ingress.class: nginx
    nginx.ingress.kubernetes.io/proxy-body-size: 500m
    nginx.ingress.kubernetes.io/proxy-connect-timeout: "300"
    nginx.ingress.kubernetes.io/proxy-read-timeout: "600"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "600"
  name: prometheus-ui
  namespace: monitor
spec:
  rules:
    - host: prometheus.company.com
      http:
        paths:
          - backend:
              serviceName: prometheus-service
              servicePort: 9090

这里用到了前面创建的资源,包括

  • 服务账号:serviceAccountName: prometheus
  • 数据存储:claimName: prometheus-data-pvc
  • 配置文件:name: prometheus-conf
  • 预警规则文件:name: prometheus-rules

同时也创建了service和ingress用于外部访问

ingress配置地址为prometheus.xxxx.com这个需要根据实际情况自行修改

0x06 部署node-expoerter

node-exporter负责手机主机信息,前文也提到过node-exporter不适合部署在容器里,所以如果有条件最好在主机上部署exporter,如果需要在kubernetes环境中部署node-exporter就得保证每个主机都运行一个node-exporter,所以需要借助kubernetesDaemonSet实现

06.damonset.node-exporter.yaml
kind: DaemonSet
apiVersion: apps/v1
metadata:
  labels:
    app: node-exporter
  name: node-exporter
  namespace: monitor
spec:
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: node-exporter
  template:
    metadata:
      labels:
        app: node-exporter
    spec:
      containers:
        - name: node-exporter
          image: prom/node-exporter:latest
          imagePullPolicy: IfNotPresent
          ports:
            - containerPort: 9100
              protocol: TCP
              name: http
      hostNetwork: true
      hostPID: true
      tolerations:
        - operator: Exists
---
kind: Service
apiVersion: v1
metadata:
  labels:
    app: node-exporter
  name: node-exporter-service
  namespace: monitor
spec:
  ports:
    - name: http
      port: 9100
      nodePort: 30011
      protocol: TCP
  type: NodePort
  selector:
    app: node-exporter

需要注意的是以下三个配置

hostNetwork: true
hostPID: true
tolerations:
  - operator: Exists

hostNetwork和hostPID保证容器能获取到主机信息,tolerations.operator保证master节点也可以运行node-exporter,默认情况下master节点是不部署应用

0x07 部署grafana

  • 创建grafana存储介质pv
07.pv.grafana.yaml
apiVersion: v1
kind: PersistentVolume
metadata:
  name: "grafana-data-pv"
  labels:
    name: grafana-data-pv
    release: stable
spec:
  capacity:
    storage: 5Gi
  accessModes:
    - ReadWriteOnce
  persistentVolumeReclaimPolicy: Retain
  hostPath:
    path: /u02/grafana/data
    type: DirectoryOrCreate
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: grafana-data-pvc
  namespace: monitor
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 5Gi
  selector:
    matchLabels:
      name: grafana-data-pv
      release: stable
需要根据实际情况创建pv
  • 部署grafana-server
08.deployment.grafana.yaml
---
kind: Deployment
apiVersion: apps/v1
metadata:
  labels:
    app: grafana
  name: grafana
  namespace: monitor
spec:
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: grafana
  template:
    metadata:
      labels:
        app: grafana
    spec:
      containers:
        - name: grafana
          image: grafana/grafana:latest
          imagePullPolicy: IfNotPresent
          env:
            - name: GF_AUTH_BASIC_ENABLED
              value: "true"
            - name: GF_AUTH_ANONYMOUS_ENABLED
              value: "false"
          readinessProbe:
            httpGet:
              path: /login
              port: 3000
          volumeMounts:
            - mountPath: /var/lib/grafana
              name: grafana-data-volume
          ports:
            - containerPort: 3000
              protocol: TCP
      volumes:
        - name: grafana-data-volume
          persistentVolumeClaim:
            claimName: grafana-data-pvc
---
kind: Service
apiVersion: v1
metadata:
  labels:
    app: grafana
  name: grafana-service
  namespace: monitor
spec:
  ports:
    - port: 3000
      targetPort: 3000
      nodePort: 30000
  selector:
    app: grafana
  type: NodePort
---
apiVersion: networking.k8s.io/v1beta1
kind: Ingress
metadata:
  annotations:
    nginx.ingress.kubernetes.io/proxy-body-size: 500m
    nginx.ingress.kubernetes.io/proxy-connect-timeout: "300"
    nginx.ingress.kubernetes.io/proxy-read-timeout: "600"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "600"
  creationTimestamp: null
  generation: 1
  name: grafana-ingress
  namespace: monitor
spec:
  rules:
    - host: grafana.company.com
      http:
        paths:
          - backend:
              serviceName: grafana-service
              servicePort: 3000
status:
  loadBalancer: {}

0x08 部署kube-state-metrics

kube-state-metrics可以获取kubernetes各资源状态,比如node,deployment,pod等 09.deployment.kube-state-metrics.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  labels:
    app.kubernetes.io/name: kube-state-metrics
    app.kubernetes.io/version: v1.9.7
  name: kube-state-metrics
  namespace: monitor
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  labels:
    app.kubernetes.io/name: kube-state-metrics
    app.kubernetes.io/version: v1.9.7
  name: kube-state-metrics
rules:
  - apiGroups:
      - ""
    resources:
      - configmaps
      - secrets
      - nodes
      - pods
      - services
      - resourcequotas
      - replicationcontrollers
      - limitranges
      - persistentvolumeclaims
      - persistentvolumes
      - namespaces
      - endpoints
    verbs:
      - list
      - watch
  - apiGroups:
      - apps
    resources:
      - statefulsets
      - daemonsets
      - deployments
      - replicasets
    verbs:
      - list
      - watch
  - apiGroups:
      - batch
    resources:
      - cronjobs
      - jobs
    verbs:
      - list
      - watch
  - apiGroups:
      - autoscaling
    resources:
      - horizontalpodautoscalers
    verbs:
      - list
      - watch
  - apiGroups:
      - authentication.k8s.io
    resources:
      - tokenreviews
    verbs:
      - create
  - apiGroups:
      - authorization.k8s.io
    resources:
      - subjectaccessreviews
    verbs:
      - create
  - apiGroups:
      - policy
    resources:
      - poddisruptionbudgets
    verbs:
      - list
      - watch
  - apiGroups:
      - certificates.k8s.io
    resources:
      - certificatesigningrequests
    verbs:
      - list
      - watch
  - apiGroups:
      - storage.k8s.io
    resources:
      - storageclasses
      - volumeattachments
    verbs:
      - list
      - watch
  - apiGroups:
      - admissionregistration.k8s.io
    resources:
      - mutatingwebhookconfigurations
      - validatingwebhookconfigurations
    verbs:
      - list
      - watch
  - apiGroups:
      - extensions
    resources:
      - daemonsets
      - deployments
      - replicasets
      - ingresses
    verbs:
      - list
      - watch
  - apiGroups:
      - networking.k8s.io
    resources:
      - networkpolicies
      - ingresses
    verbs:
      - list
      - watch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  labels:
    app.kubernetes.io/name: kube-state-metrics
    app.kubernetes.io/version: v1.9.7
  name: kube-state-metrics
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: kube-state-metrics
subjects:
  - kind: ServiceAccount
    name: kube-state-metrics
    namespace: monitor
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app.kubernetes.io/name: kube-state-metrics
    app.kubernetes.io/version: v1.9.7
  name: kube-state-metrics
  namespace: monitor
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: kube-state-metrics
  template:
    metadata:
      labels:
        app.kubernetes.io/name: kube-state-metrics
        app.kubernetes.io/version: v1.9.7
    spec:
      containers:
        - image: quay.mirrors.ustc.edu.cn/coreos/kube-state-metrics:v1.9.7
          imagePullPolicy: IfNotPresent
          livenessProbe:
            httpGet:
              path: /healthz
              port: 8080
            initialDelaySeconds: 5
            timeoutSeconds: 5
          name: kube-state-metrics
          ports:
            - containerPort: 8080
              name: http-metrics
            - containerPort: 8081
              name: telemetry
          readinessProbe:
            httpGet:
              path: /
              port: 8081
            initialDelaySeconds: 5
            timeoutSeconds: 5
      nodeSelector:
        beta.kubernetes.io/os: linux
      serviceAccountName: kube-state-metrics
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app.kubernetes.io/name: kube-state-metrics
    app.kubernetes.io/version: v1.9.7
  name: kube-state-metrics
  namespace: monitor
spec:
  clusterIP: None
  ports:
    - name: http-metrics
      port: 8080
      targetPort: http-metrics
    - name: telemetry
      port: 8081
      targetPort: telemetry
  selector:
    app.kubernetes.io/name: kube-state-metrics

也是通过创建ServiceAccount调用kubernetes api获取资源信息

0x09 部署AlertManager

10.deployment.alertmanager.yaml
kind: ConfigMap
apiVersion: v1
metadata:
  name: alertmanager
  namespace: monitor
data:
  config.yml: |-
    global:
      resolve_timeout: 5m
      smtp_smarthost: 'smtp.gmail.com:587'
      smtp_from: 'foo@bar.com'
      smtp_auth_username: 'foo@bar.com'
      smtp_auth_password: 'barfoo'
      slack_api_url: 'https://hooks.slack.com/services/abc123'
    templates:
    - '/etc/alertmanager-templates/*.tmpl'
    # The root route on which each incoming alert enters.
    route:
      group_by: ['alertname', 'cluster', 'service']
      group_wait: 30s
      group_interval: 5m
      #repeat_interval: 1m
      repeat_interval: 15m
      # A default receiver
      # If an alert isn't caught by a route, send it to default.
      receiver: default
      # All the above attributes are inherited by all child routes and can
      # overwritten on each.
      # The child route trees.
      routes:
      # Send severity=slack alerts to slack.
      - match:
          severity: slack
        receiver: slack_alert
      - match:
          severity: email
        receiver: slack_alert
    #   receiver: email_alert
    receivers:
    - name: 'default'
      slack_configs:
      - channel: '#devops'
        text: '<!channel>{{ template "slack.devops.text" . }}'
        send_resolved: true
    - name: 'slack_alert'
      slack_configs:
      - channel: '#devops'
        send_resolved: true
    - name: 'email_alert'
      email_configs:
      - to: 'foo@bar.com'
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: alertmanager
  namespace: monitor
spec:
  replicas: 1
  selector:
    matchLabels:
      app: alertmanager
  template:
    metadata:
      name: alertmanager
      labels:
        app: alertmanager
    spec:
      containers:
        - name: alertmanager
          image: prom/alertmanager:latest
          args:
            - '-config.file=/etc/alertmanager/config.yml'
            - '-storage.path=/alertmanager'
          ports:
            - name: alertmanager
              containerPort: 9093
          volumeMounts:
            - name: config-volume
              mountPath: /etc/alertmanager
            - name: templates-volume
              mountPath: /etc/alertmanager-templates
            - name: alertmanager
              mountPath: /alertmanager
      volumes:
        - name: config-volume
          configMap:
            name: alertmanager
        - name: templates-volume
          configMap:
            name: alertmanager-templates
        - name: alertmanager
          emptyDir: {}
---
apiVersion: v1
kind: Service
metadata:
  annotations:
    prometheus.io/scrape: 'true'
    prometheus.io/path: '/alertmanager/metrics'
  labels:
    name: alertmanager
  name: alertmanager-service
  namespace: monitor
spec:
  selector:
    app: alertmanager
  type: NodePort
  ports:
    - name: alertmanager
      protocol: TCP
      port: 9093
      targetPort: 9093
      nodePort: 30028

0x0A 执行安装

  • 配置文件一览
# ll
-rw-r--r--   1 asan  staff     84 Apr 29 15:20 01.namespace.yaml
-rwxr-xr-x@  1 asan  staff    745 Apr 29 16:29 02.serviceaccount.prometheus.yaml*
-rwxr-xr-x@  1 asan  staff   6604 Jun  1 15:00 03.configmap.prometheus.yaml*
-rw-r--r--   1 asan  staff    591 Apr 29 15:49 04.pv.prometheus.yaml
-rwxr-xr-x@  1 asan  staff   2206 Jun  1 16:04 05.deployment.prometheus.yaml*
-rw-r--r--   1 asan  staff    881 May 27 23:57 06.daemonset.node-exporter.yaml
-rw-r--r--   1 asan  staff    576 Jun  1 16:30 07.pv.grafana.yaml
-rw-r--r--   1 asan  staff   1808 Jun  1 16:31 08.deployment.grafana.yaml
-rw-r--r--   1 asan  staff   4230 Apr 29 18:23 09.deployment.kube-state-metric.yaml
-rw-r--r--   1 asan  staff   2953 Jun  1 16:25 10.deployment.alertmanager.yaml
  • 执行部署
kubectl apploy -f 01.namespace.yaml --record
kubectl apploy -f 02.serviceaccount.prometheus.yaml --record
kubectl apploy -f 03.configmap.prometheus.yaml --record
kubectl apploy -f 04.pv.prometheus.yaml --record
kubectl apploy -f 05.deployment.prometheus.yaml --record
kubectl apploy -f 06.daemonset.node-exporter.yaml --record
kubectl apploy -f 07.pv.grafana.yaml --record
kubectl apploy -f 08.deployment.grafana.yaml --record
kubectl apploy -f 09.deployment.kube-state-metric.yaml --record
kubectl apploy -f 10.deployment.alertmanager.yaml --record

0x0B 访问

prometheus和grafana我们分别配置了两个Ingress,因此可以直接通过域名进行访问

  • prometheus.company.com
  • grafana.company.com
需要将k8s node的ip和域名做映射,可以在dns中进行配置,如果本地需要进行测试可以直接在本地hosts文件配置即可
{随便一个非master节点ip} prometheus.company.com
{随便一个非master节点ip} grafana.company.com
可能存在多个节点,那么就需要经过负载服务器进行转发,可以是F5或者Nginx
,
Trackback

no comment untill now

Add your comment now