7月
24
0x00 介绍
本文介绍如何在kubernetes环境中部署一整套完整的prometheus监控系统,涉及内容和知识点较多,部署过程相对上文也复杂不少,如果不是kubernetes环境的同学可以跳过该文,直接阅读下一篇,以免被复杂的操作劝退。本文包括以下内容
- 部署prometheus
- 部署grafana
- 部署node-exporter
- 部署kube-state-metrics
- 部署alertmanager
0x01 创建命名空间
01.namespace.yamlapiVersion: v1 kind: Namespace metadata: name: monitor labels: name: monitor
0x02 创建服务账号
服务账号(ServiceAccount)用户POD调用api出具的凭证,通过ServiceAccount和ClusterRole的绑定可以授权给账号指定资源的操作权限
02.serviceaccount.prometheus.yamlapiVersion: v1 kind: ServiceAccount metadata: name: prometheus namespace: monitor labels: app: prometheus --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: prometheus rules: <li>apiGroups: [""]</li> resources: - nodes - nodes/proxy - services - endpoints - pods verbs: ["get", "list", "watch"] <li>apiGroups:</li> - networking.k8s.io resources: - ingresses verbs: ["get", "list", "watch"] <li>nonResourceURLs: ["/metrics"]</li> verbs: ["get"] --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: prometheus roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: prometheus subjects: <li>kind: ServiceAccount</li> name: prometheus namespace: monitor
0x03 创建Configmap
03.configmap.prometheus.yamlapiVersion: v1 kind: ConfigMap metadata: name: prometheus-rules namespace: monitor labels: app: prometheus data: cpu-usage.rule: | groups: - name: NodeCPUUsage rules: - alert: NodeCPUUsage expr: (100 - (avg by (instance) (irate(node_cpu{name="node-exporter",mode="idle"}[5m])) * 100)) > 75 for: 2m labels: severity: "page" annotations: summary: "{{$labels.instance}}: High CPU usage detected" description: "{{$labels.instance}}: CPU usage is above 75% (current value is: {{ $value }})" --- apiVersion: v1 kind: ConfigMap metadata: name: prometheus-conf namespace: monitor labels: app: prometheus data: prometheus.yml: |- global: scrape_interval: 15s evaluation_interval: 15s alerting: alertmanagers: - static_configs: - targets: rule_files: scrape_configs: - job_name: 'kubernetes-machine' static_configs: - targets: ['10.112.22.14:9100'] labels: node: 'k8s-master-dev-1' - targets: ['10.112.22.16:9100'] labels: node: 'k8s-node-dev-1' - job_name: 'kubernetes-jvm' kubernetes_sd_configs: - role: endpoints scheme: http tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token relabel_configs: - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape, __meta_kubernetes_service_annotation_prometheus_io_jvm_scrape] action: keep regex: true;true - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_app_metrics_patn] action: replace target_label: __metrics_path__ regex: (.+) - source_labels: [__meta_kubernetes_pod_ip, __meta_kubernetes_service_annotation_prometheus_io_app_metrics_port] action: replace target_label: __address__ regex: (.+);(.+) replacement: $1:$2 - source_labels: [__meta_kubernetes_pod_name] target_label: instance - source_labels: [__meta_kubernetes_endpoint_node_name] target_label: node - source_labels: [__meta_kubernetes_namespace] target_label: namespace - job_name: 'kubernetes-nodes' scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token kubernetes_sd_configs: - role: node relabel_configs: - action: labelmap regex: __meta_kubernetes_node_label_(.+) - target_label: __address__ replacement: kubernetes.default.svc:443 - source_labels: [__meta_kubernetes_node_name] regex: (.+) target_label: __metrics_path__ replacement: /api/v1/nodes/${1}/proxy/metrics - job_name: 'k8s-cadvisor' scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token kubernetes_sd_configs: - role: node relabel_configs: - action: labelmap regex: __meta_kubernetes_node_label_(.+) - target_label: __address__ replacement: kubernetes.default.svc:443 - source_labels: [__meta_kubernetes_node_name] regex: (.+) target_label: __metrics_path__ replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor - job_name: kube-state-metrics kubernetes_sd_configs: - role: endpoints namespaces: names: - monitor relabel_configs: - source_labels: [__meta_kubernetes_service_label_app_kubernetes_io_name] regex: kube-state-metrics replacement: $1 action: keep - action: labelmap regex: __meta_kubernetes_service_label_(.+) - source_labels: [__meta_kubernetes_namespace] action: replace target_label: k8s_namespace - source_labels: [__meta_kubernetes_service_name] action: replace target_label: k8s_sname
- prometheus-rules:预警规则文件
- prometheus-conf:prometheus配置文件
配置文件中配置了以下job
- kubernetes-jvm:监控java容器
- kubernetes-machine:监控主机
- k8s-cadvisor:兼容容器信息
- kube-state-metrics:监控kubernetes集群信息
百度出来的博客都会配很多job,其实根据自己的需要进行选择,不要盲目照抄。
kubernetes-machine中kubernetes-machine为主机ip,labels:nodes为主机名,需要根据实际情况进行修改,之所以加上nodes这个label方便后面配置监控面板展示和关联
0x04 创建pv
pv指定prometheus数据存储介质,建议使用共享存储,比如NFS,如果存储介质不是共享存储需要在prometheus上指定主机亲和性。
04.pv.yamlapiVersion: v1 kind: PersistentVolume metadata: name: "prometheus-data-pv" labels: name: prometheus-data-pv release: stable spec: capacity: storage: 5Gi accessModes: - ReadWriteOnce persistentVolumeReclaimPolicy: Retain hostPath: path: /u02/prometheus/data type: DirectoryOrCreate --- apiVersion: v1 kind: PersistentVolumeClaim metadata: name: prometheus-data-pvc namespace: monitor spec: accessModes: - ReadWriteOnce resources: requests: storage: 5Gi selector: matchLabels: name: prometheus-data-pv release: stable
0x05 部署prometheus
05.deployment.prometheus.yamlkind: Deployment apiVersion: apps/v1 metadata: labels: app: prometheus name: prometheus namespace: monitor spec: replicas: 1 revisionHistoryLimit: 10 selector: matchLabels: app: prometheus template: metadata: labels: app: prometheus spec: serviceAccountName: prometheus securityContext: runAsUser: 1000 fsGroup: 1000 containers: - name: prometheus image: prom/prometheus:latest imagePullPolicy: IfNotPresent volumeMounts: - mountPath: /prometheus name: prometheus-data-volume - mountPath: /etc/prometheus/prometheus.yml name: prometheus-conf-volume subPath: prometheus.yml - mountPath: /etc/prometheus/rules name: prometheus-rules-volume ports: - containerPort: 9090 protocol: TCP volumes: - name: prometheus-data-volume persistentVolumeClaim: claimName: prometheus-data-pvc - name: prometheus-conf-volume configMap: name: prometheus-conf - name: prometheus-rules-volume configMap: name: prometheus-rules tolerations: - key: node-role.kubernetes.io/master effect: NoSchedule --- kind: Service apiVersion: v1 metadata: annotations: prometheus.io/scrape: 'true' labels: app: prometheus name: prometheus-service namespace: monitor spec: ports: - port: 9090 targetPort: 9090 nodePort: 30001 selector: app: prometheus type: NodePort --- apiVersion: networking.k8s.io/v1beta1 kind: Ingress metadata: annotations: kubernetes.io/ingress.class: nginx nginx.ingress.kubernetes.io/proxy-body-size: 500m nginx.ingress.kubernetes.io/proxy-connect-timeout: "300" nginx.ingress.kubernetes.io/proxy-read-timeout: "600" nginx.ingress.kubernetes.io/proxy-send-timeout: "600" name: prometheus-ui namespace: monitor spec: rules: - host: prometheus.company.com http: paths: - backend: serviceName: prometheus-service servicePort: 9090
这里用到了前面创建的资源,包括
- 服务账号:serviceAccountName: prometheus
- 数据存储:claimName: prometheus-data-pvc
- 配置文件:name: prometheus-conf
- 预警规则文件:name: prometheus-rules
同时也创建了service和ingress用于外部访问
ingress配置地址为prometheus.xxxx.com
这个需要根据实际情况自行修改
0x06 部署node-expoerter
node-exporter负责手机主机信息,前文也提到过node-exporter不适合部署在容器里,所以如果有条件最好在主机上部署exporter,如果需要在kubernetes环境中部署node-exporter就得保证每个主机都运行一个node-exporter,所以需要借助kubernetesDaemonSet实现
06.damonset.node-exporter.yamlkind: DaemonSet apiVersion: apps/v1 metadata: labels: app: node-exporter name: node-exporter namespace: monitor spec: revisionHistoryLimit: 10 selector: matchLabels: app: node-exporter template: metadata: labels: app: node-exporter spec: containers: - name: node-exporter image: prom/node-exporter:latest imagePullPolicy: IfNotPresent ports: - containerPort: 9100 protocol: TCP name: http hostNetwork: true hostPID: true tolerations: - operator: Exists --- kind: Service apiVersion: v1 metadata: labels: app: node-exporter name: node-exporter-service namespace: monitor spec: ports: - name: http port: 9100 nodePort: 30011 protocol: TCP type: NodePort selector: app: node-exporter
需要注意的是以下三个配置
hostNetwork: true hostPID: true tolerations: - operator: Exists
hostNetwork和hostPID保证容器能获取到主机信息,tolerations.operator保证master节点也可以运行node-exporter,默认情况下master节点是不部署应用
0x07 部署grafana
- 创建grafana存储介质pv
apiVersion: v1 kind: PersistentVolume metadata: name: "grafana-data-pv" labels: name: grafana-data-pv release: stable spec: capacity: storage: 5Gi accessModes: - ReadWriteOnce persistentVolumeReclaimPolicy: Retain hostPath: path: /u02/grafana/data type: DirectoryOrCreate --- apiVersion: v1 kind: PersistentVolumeClaim metadata: name: grafana-data-pvc namespace: monitor spec: accessModes: - ReadWriteOnce resources: requests: storage: 5Gi selector: matchLabels: name: grafana-data-pv release: stable
需要根据实际情况创建pv
- 部署grafana-server
--- kind: Deployment apiVersion: apps/v1 metadata: labels: app: grafana name: grafana namespace: monitor spec: replicas: 1 revisionHistoryLimit: 10 selector: matchLabels: app: grafana template: metadata: labels: app: grafana spec: containers: - name: grafana image: grafana/grafana:latest imagePullPolicy: IfNotPresent env: - name: GF_AUTH_BASIC_ENABLED value: "true" - name: GF_AUTH_ANONYMOUS_ENABLED value: "false" readinessProbe: httpGet: path: /login port: 3000 volumeMounts: - mountPath: /var/lib/grafana name: grafana-data-volume ports: - containerPort: 3000 protocol: TCP volumes: - name: grafana-data-volume persistentVolumeClaim: claimName: grafana-data-pvc --- kind: Service apiVersion: v1 metadata: labels: app: grafana name: grafana-service namespace: monitor spec: ports: - port: 3000 targetPort: 3000 nodePort: 30000 selector: app: grafana type: NodePort --- apiVersion: networking.k8s.io/v1beta1 kind: Ingress metadata: annotations: nginx.ingress.kubernetes.io/proxy-body-size: 500m nginx.ingress.kubernetes.io/proxy-connect-timeout: "300" nginx.ingress.kubernetes.io/proxy-read-timeout: "600" nginx.ingress.kubernetes.io/proxy-send-timeout: "600" creationTimestamp: null generation: 1 name: grafana-ingress namespace: monitor spec: rules: - host: grafana.company.com http: paths: - backend: serviceName: grafana-service servicePort: 3000 status: loadBalancer: {}
0x08 部署kube-state-metrics
kube-state-metrics可以获取kubernetes各资源状态,比如node,deployment,pod等 09.deployment.kube-state-metrics.yamlapiVersion: v1 kind: ServiceAccount metadata: labels: app.kubernetes.io/name: kube-state-metrics app.kubernetes.io/version: v1.9.7 name: kube-state-metrics namespace: monitor --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: labels: app.kubernetes.io/name: kube-state-metrics app.kubernetes.io/version: v1.9.7 name: kube-state-metrics rules: - apiGroups: - "" resources: - configmaps - secrets - nodes - pods - services - resourcequotas - replicationcontrollers - limitranges - persistentvolumeclaims - persistentvolumes - namespaces - endpoints verbs: - list - watch - apiGroups: - apps resources: - statefulsets - daemonsets - deployments - replicasets verbs: - list - watch - apiGroups: - batch resources: - cronjobs - jobs verbs: - list - watch - apiGroups: - autoscaling resources: - horizontalpodautoscalers verbs: - list - watch - apiGroups: - authentication.k8s.io resources: - tokenreviews verbs: - create - apiGroups: - authorization.k8s.io resources: - subjectaccessreviews verbs: - create - apiGroups: - policy resources: - poddisruptionbudgets verbs: - list - watch - apiGroups: - certificates.k8s.io resources: - certificatesigningrequests verbs: - list - watch - apiGroups: - storage.k8s.io resources: - storageclasses - volumeattachments verbs: - list - watch - apiGroups: - admissionregistration.k8s.io resources: - mutatingwebhookconfigurations - validatingwebhookconfigurations verbs: - list - watch - apiGroups: - extensions resources: - daemonsets - deployments - replicasets - ingresses verbs: - list - watch - apiGroups: - networking.k8s.io resources: - networkpolicies - ingresses verbs: - list - watch --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: labels: app.kubernetes.io/name: kube-state-metrics app.kubernetes.io/version: v1.9.7 name: kube-state-metrics roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: kube-state-metrics subjects: - kind: ServiceAccount name: kube-state-metrics namespace: monitor --- apiVersion: apps/v1 kind: Deployment metadata: labels: app.kubernetes.io/name: kube-state-metrics app.kubernetes.io/version: v1.9.7 name: kube-state-metrics namespace: monitor spec: replicas: 1 selector: matchLabels: app.kubernetes.io/name: kube-state-metrics template: metadata: labels: app.kubernetes.io/name: kube-state-metrics app.kubernetes.io/version: v1.9.7 spec: containers: - image: quay.mirrors.ustc.edu.cn/coreos/kube-state-metrics:v1.9.7 imagePullPolicy: IfNotPresent livenessProbe: httpGet: path: /healthz port: 8080 initialDelaySeconds: 5 timeoutSeconds: 5 name: kube-state-metrics ports: - containerPort: 8080 name: http-metrics - containerPort: 8081 name: telemetry readinessProbe: httpGet: path: / port: 8081 initialDelaySeconds: 5 timeoutSeconds: 5 nodeSelector: beta.kubernetes.io/os: linux serviceAccountName: kube-state-metrics --- apiVersion: v1 kind: Service metadata: labels: app.kubernetes.io/name: kube-state-metrics app.kubernetes.io/version: v1.9.7 name: kube-state-metrics namespace: monitor spec: clusterIP: None ports: - name: http-metrics port: 8080 targetPort: http-metrics - name: telemetry port: 8081 targetPort: telemetry selector: app.kubernetes.io/name: kube-state-metrics
也是通过创建ServiceAccount调用kubernetes api获取资源信息
0x09 部署AlertManager
10.deployment.alertmanager.yamlkind: ConfigMap apiVersion: v1 metadata: name: alertmanager namespace: monitor data: config.yml: |- global: resolve_timeout: 5m smtp_smarthost: 'smtp.gmail.com:587' smtp_from: 'foo@bar.com' smtp_auth_username: 'foo@bar.com' smtp_auth_password: 'barfoo' slack_api_url: 'https://hooks.slack.com/services/abc123' templates: - '/etc/alertmanager-templates/*.tmpl' # The root route on which each incoming alert enters. route: group_by: ['alertname', 'cluster', 'service'] group_wait: 30s group_interval: 5m #repeat_interval: 1m repeat_interval: 15m # A default receiver # If an alert isn't caught by a route, send it to default. receiver: default # All the above attributes are inherited by all child routes and can # overwritten on each. # The child route trees. routes: # Send severity=slack alerts to slack. - match: severity: slack receiver: slack_alert - match: severity: email receiver: slack_alert # receiver: email_alert receivers: - name: 'default' slack_configs: - channel: '#devops' text: '<!channel>{{ template "slack.devops.text" . }}' send_resolved: true - name: 'slack_alert' slack_configs: - channel: '#devops' send_resolved: true - name: 'email_alert' email_configs: - to: 'foo@bar.com' --- apiVersion: extensions/v1beta1 kind: Deployment metadata: name: alertmanager namespace: monitor spec: replicas: 1 selector: matchLabels: app: alertmanager template: metadata: name: alertmanager labels: app: alertmanager spec: containers: - name: alertmanager image: prom/alertmanager:latest args: - '-config.file=/etc/alertmanager/config.yml' - '-storage.path=/alertmanager' ports: - name: alertmanager containerPort: 9093 volumeMounts: - name: config-volume mountPath: /etc/alertmanager - name: templates-volume mountPath: /etc/alertmanager-templates - name: alertmanager mountPath: /alertmanager volumes: - name: config-volume configMap: name: alertmanager - name: templates-volume configMap: name: alertmanager-templates - name: alertmanager emptyDir: {} --- apiVersion: v1 kind: Service metadata: annotations: prometheus.io/scrape: 'true' prometheus.io/path: '/alertmanager/metrics' labels: name: alertmanager name: alertmanager-service namespace: monitor spec: selector: app: alertmanager type: NodePort ports: - name: alertmanager protocol: TCP port: 9093 targetPort: 9093 nodePort: 30028
0x0A 执行安装
- 配置文件一览
# ll -rw-r--r-- 1 asan staff 84 Apr 29 15:20 01.namespace.yaml -rwxr-xr-x@ 1 asan staff 745 Apr 29 16:29 02.serviceaccount.prometheus.yaml* -rwxr-xr-x@ 1 asan staff 6604 Jun 1 15:00 03.configmap.prometheus.yaml* -rw-r--r-- 1 asan staff 591 Apr 29 15:49 04.pv.prometheus.yaml -rwxr-xr-x@ 1 asan staff 2206 Jun 1 16:04 05.deployment.prometheus.yaml* -rw-r--r-- 1 asan staff 881 May 27 23:57 06.daemonset.node-exporter.yaml -rw-r--r-- 1 asan staff 576 Jun 1 16:30 07.pv.grafana.yaml -rw-r--r-- 1 asan staff 1808 Jun 1 16:31 08.deployment.grafana.yaml -rw-r--r-- 1 asan staff 4230 Apr 29 18:23 09.deployment.kube-state-metric.yaml -rw-r--r-- 1 asan staff 2953 Jun 1 16:25 10.deployment.alertmanager.yaml
- 执行部署
kubectl apploy -f 01.namespace.yaml --record kubectl apploy -f 02.serviceaccount.prometheus.yaml --record kubectl apploy -f 03.configmap.prometheus.yaml --record kubectl apploy -f 04.pv.prometheus.yaml --record kubectl apploy -f 05.deployment.prometheus.yaml --record kubectl apploy -f 06.daemonset.node-exporter.yaml --record kubectl apploy -f 07.pv.grafana.yaml --record kubectl apploy -f 08.deployment.grafana.yaml --record kubectl apploy -f 09.deployment.kube-state-metric.yaml --record kubectl apploy -f 10.deployment.alertmanager.yaml --record
0x0B 访问
prometheus和grafana我们分别配置了两个Ingress,因此可以直接通过域名进行访问
- prometheus.company.com
- grafana.company.com
需要将k8s node的ip和域名做映射,可以在dns中进行配置,如果本地需要进行测试可以直接在本地hosts文件配置即可
{随便一个非master节点ip} prometheus.company.com{随便一个非master节点ip} grafana.company.com
可能存在多个节点,那么就需要经过负载服务器进行转发,可以是F5或者Nginx
Address: https://zhengjianfeng.cn/?p=571
no comment untill now