手把手教你创建 OpenClaw Skill:子弹笔记(bujo)实战
2026/6/1 16:35:55
在运维Clawdbot服务时,我们经常会遇到这样的问题:服务突然变慢却不知道原因,磁盘满了才发现日志爆仓,用户投诉了才意识到接口出错。这些问题如果能在发生前预警,就能大幅提升服务稳定性。
Prometheus+Grafana的组合就像给Clawdbot装上了"健康监测仪"和"智能警报器",它能:
这套系统特别适合需要7×24小时稳定运行的AI服务,下面我们就从零开始搭建。
确保你的服务器满足:
使用Docker Compose快速部署所有组件:
mkdir -p ~/monitoring && cd ~/monitoring cat > docker-compose.yml <<EOF version: '3' services: prometheus: image: prom/prometheus ports: - "9090:9090" volumes: - ./prometheus.yml:/etc/prometheus/prometheus.yml command: - '--config.file=/etc/prometheus/prometheus.yml' grafana: image: grafana/grafana ports: - "3000:3000" volumes: - grafana-storage:/var/lib/grafana depends_on: - prometheus node-exporter: image: prom/node-exporter ports: - "9100:9100" volumes: grafana-storage: EOF创建监控目标配置文件:
cat > prometheus.yml <<EOF global: scrape_interval: 15s scrape_configs: - job_name: 'prometheus' static_configs: - targets: ['localhost:9090'] - job_name: 'node-exporter' static_configs: - targets: ['node-exporter:9100'] - job_name: 'clawdbot' metrics_path: '/metrics' static_configs: - targets: ['your-clawdbot-ip:your-metrics-port'] EOF启动服务后,Node Exporter会自动采集:
通过http://服务器IP:9100/metrics可以查看原始指标数据。
需要在Clawdbot中暴露监控指标(以Python Flask为例):
from prometheus_client import start_http_server, Counter, Gauge # 定义指标 REQUEST_COUNT = Counter('clawdbot_requests_total', 'Total API requests') ERROR_COUNT = Counter('clawdbot_errors_total', 'Total API errors') PROCESSING_TIME = Gauge('clawdbot_processing_seconds', 'Request processing time') @app.route('/api') def handle_request(): start_time = time.time() REQUEST_COUNT.inc() try: # 业务逻辑 time.sleep(0.1) except Exception: ERROR_COUNT.inc() raise PROCESSING_TIME.set(time.time() - start_time) return "OK" # 在单独端口暴露指标 start_http_server(8000)对于Clawdbot服务,建议重点关注:
http://服务器IP:3000http://prometheus:9090使用社区模板快速搭建:
创建新的Dashboard,添加以下面板:
请求流量面板
rate(clawdbot_requests_total[1m])错误率面板
clawdbot_errors_total / clawdbot_requests_total处理时间面板
clawdbot_processing_seconds编辑prometheus.yml添加:
rule_files: - 'alert.rules' alerting: alertmanagers: - static_configs: - targets: ['alertmanager:9093']创建告警规则文件:
cat > alert.rules <<EOF groups: - name: clawdbot-alerts rules: - alert: HighErrorRate expr: rate(clawdbot_errors_total[5m]) / rate(clawdbot_requests_total[5m]) > 0.1 for: 10m labels: severity: critical annotations: summary: "High error rate on {{ $labels.instance }}" description: "Error rate is {{ $value }}" - alert: ServiceDown expr: up{job="clawdbot"} == 0 for: 5m labels: severity: critical annotations: summary: "Service {{ $labels.instance }} is down" EOF# docker-compose-ha.yml services: prometheus: deploy: replicas: 2 configs: - source: prometheus_config target: /etc/prometheus/prometheus.yml alertmanager: image: prom/alertmanager ports: - "9093:9093" volumes: - ./alertmanager.yml:/etc/alertmanager/alertmanager.ymlQ1:指标采集不到怎么办?
/metrics端点是否可访问Q2:Grafana显示"No Data"
Q3:告警不触发
for持续时间是否足够这套监控体系上线后,我们的Clawdbot服务SLA从99.5%提升到了99.95%,平均故障发现时间从15分钟缩短到30秒内。最重要的是,运维同学终于不用半夜被叫起来处理突发问题了。
获取更多AI镜像
想探索更多AI镜像和应用场景?访问 CSDN星图镜像广场,提供丰富的预置镜像,覆盖大模型推理、图像生成、视频生成、模型微调等多个领域,支持一键部署。