强化学习与大语言模型对齐:从RLHF到DPO的实践指南
2026/5/11 21:12:32
Kubernetes资源管理和成本优化是云原生运维的重要课题。通过合理配置资源、优化调度策略和实施精细化管理,可以显著降低基础设施成本。
┌─────────────────────────────────────────────────────────────────┐ │ 成本优化架构 │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │ │ 资源分析 │───▶│ 优化建议 │───▶│ 自动执行 │───▶│ 成本监控 │ │ │ │ (Metrics)│ │ (Advisor)│ │ (Actions) │ │ (Billing) │ │ │ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │ │ │ │ │ │ ▼ ▼ │ │ ┌─────────────────────────────────────────────────────────┐ │ │ │ Kubernetes集群 │ │ │ │ (Nodes / Pods / Storage / Network) │ │ │ └─────────────────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────────────┘| 维度 | 优化方向 | 工具/方法 |
|---|---|---|
| 计算资源 | CPU/内存优化 | HPA/VPA、资源请求/限制 |
| 存储资源 | 存储优化 | Local PV、StorageClass |
| 网络资源 | 流量优化 | NetworkPolicy、CDN |
| 节点资源 | 节点调度优化 | Node Affinity、Taints |
| 资源闲置 | 闲置资源清理 | 自动清理脚本 |
apiVersion: apps/v1 kind: Deployment metadata: name: optimized-app spec: template: spec: containers: - name: app image: my-app:1.0.0 resources: requests: memory: "256Mi" cpu: "100m" limits: memory: "512Mi" cpu: "200m"apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: app-hpa spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: optimized-app minReplicas: 1 maxReplicas: 5 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 60 - type: Resource resource: name: memory target: type: Utilization averageUtilization: 70apiVersion: autoscaling.k8s.io/v1 kind: VerticalPodAutoscaler metadata: name: app-vpa spec: targetRef: apiVersion: "apps/v1" kind: Deployment name: optimized-app updatePolicy: updateMode: "Auto" resourcePolicy: containerPolicies: - containerName: "*" minAllowed: cpu: 50m memory: 128Mi maxAllowed: cpu: 1 memory: 1GiapiVersion: v1 kind: Pod metadata: name: database-pod spec: affinity: nodeAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 100 preference: matchExpressions: - key: node-type operator: In values: - spot tolerations: - key: "spot" operator: "Exists" effect: "NoSchedule"apiVersion: cluster.x-k8s.io/v1beta1 kind: MachineDeployment metadata: name: spot-nodes spec: replicas: 3 selector: matchLabels: node-type: spot template: spec: providerID: aws:///us-west-2/i-1234567890 nodeRef: apiGroup: infrastructure.cluster.x-k8s.io kind: AWSMachineTemplate name: spot-templateapiVersion: autoscaling/v1 kind: HorizontalPodAutoscaler metadata: name: node-group-hpa spec: scaleTargetRef: apiVersion: v1 kind: Service name: node-group minReplicas: 2 maxReplicas: 10 targetCPUUtilizationPercentage: 70apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: standard-storage provisioner: ebs.csi.aws.com parameters: type: gp2 fsType: ext4 allowVolumeExpansion: true mountOptions: - noatime reclaimPolicy: DeleteapiVersion: v1 kind: PersistentVolume metadata: name: local-pv spec: capacity: storage: 100Gi accessModes: - ReadWriteOnce persistentVolumeReclaimPolicy: Delete storageClassName: local-storage local: path: /mnt/disks/ssd1 nodeAffinity: required: nodeSelectorTerms: - matchExpressions: - key: kubernetes.io/hostname operator: In values: - node-1#!/bin/bash # 清理未使用的PVC echo "Cleaning up unused PVCs..." kubectl get pvc --all-namespaces -o json | \ jq -r '.items[] | select(.status.phase == "Bound") | .metadata.name' | \ while read pvc; do if ! kubectl get pods --all-namespaces -o json | \ jq -e '.items[].spec.volumes[] | select(.persistentVolumeClaim.claimName == "'$pvc'")' > /dev/null 2>&1; then echo "Deleting unused PVC: $pvc" kubectl delete pvc $pvc --all-namespaces fi done # 清理未绑定的PV echo "Cleaning up unbound PVs..." kubectl get pv -o json | \ jq -r '.items[] | select(.status.phase == "Available") | .metadata.name' | \ while read pv; do echo "Deleting unbound PV: $pv" kubectl delete pv $pv doneapiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: cost-monitor spec: selector: matchLabels: app: cost-exporter endpoints: - port: metrics interval: 30s# 节点成本 sum(kube_node_labels) by (node) * on(node) group_left sum(node_hourly_cost) # Pod成本 sum(container_cpu_usage_seconds_total) by (pod, namespace) * 0.05 # 存储成本 sum(kube_persistentvolumeclaim_resource_requests_storage_bytes) by (namespace) * 0.0001{ "title": "Kubernetes Cost Dashboard", "panels": [ { "type": "graph", "targets": [ { "expr": "sum(node_hourly_cost)", "legendFormat": "Total Node Cost" } ] }, { "type": "stat", "targets": [ { "expr": "sum(kube_persistentvolumeclaim_resource_requests_storage_bytes) * 0.0001", "legendFormat": "Storage Cost" } ] }, { "type": "table", "targets": [ { "expr": "sum(container_cpu_usage_seconds_total) by (namespace) * 0.05", "legendFormat": "{{namespace}}" } ] } ] }apiVersion: apps/v1 kind: Deployment metadata: name: cost-optimized-app spec: template: spec: containers: - name: app image: my-app:1.0.0 resources: requests: memory: "{{ .Values.resources.requests.memory }}" cpu: "{{ .Values.resources.requests.cpu }}" limits: memory: "{{ .Values.resources.limits.memory }}" cpu: "{{ .Values.resources.limits.cpu }}" lifecycle: preStop: exec: command: ["sh", "-c", "sleep 5"]#!/bin/bash # 清理终止状态的Pod kubectl delete pods --all-namespaces --field-selector status.phase=Succeeded kubectl delete pods --all-namespaces --field-selector status.phase=Failed # 清理过期的Job kubectl delete jobs --all-namespaces --field-selector status.succeeded=1 # 清理未使用的ConfigMap kubectl get configmaps --all-namespaces -o json | \ jq -r '.items[] | select(.metadata.ownerReferences == null) | .metadata.name' | \ while read cm; do kubectl delete configmap $cm --all-namespaces doneapiVersion: budgets.example.com/v1 kind: Budget metadata: name: monthly-budget spec: limit: 10000 period: monthly alertThresholds: - threshold: 80 action: notify - threshold: 95 action: restrictKubernetes成本优化是一个持续迭代的过程。通过合理配置资源请求和限制、优化节点调度、实施存储优化和建立成本监控体系,可以显著降低云原生基础设施的运营成本。