服务公告

服务公告 > 综合新闻 > Prometheus：Prometheus故障排查

Prometheus：Prometheus故障排查

发布时间:2026-04-25 12:01

标签： Prometheus prometheus 故障排查故障排查问题

一、前言

搞过Prometheus的人都遇到过这破事儿：服务明明跑着，监控数据就是采不上来，查询全是N/A。磁盘满了、内存爆了、端口被占了、各种配置写错了一晚上都在排错。本文总结线上真实案例的排错思路，手把手带你定位那些Prometheus老兵踩过的坑。

二、操作步骤

第1步：检查Prometheus进程状态和端口监听

ps aux | grep prometheus | grep -v grep

预期输出：

prometheus 1234 0.5 2.1 /opt/prometheus/prometheus --config.file=/etc/prometheus/prometheus.yml

如果进程不存在，说明没启动成功。看日志：

journalctl -u prometheus -n 50 --no-pager

预期输出：

level=error ts=2024-01-15T10:30:00.123Z caller=main.go:XXX msg="Failed to parse config" err="yaml: line 15: did not find expected key"

配置文件格式错误导致启动失败。

第2步：验证Prometheus Web接口是否可达

curl -s http://localhost:9090/-/healthy

预期输出：

Prometheus is Healthy.

如果返回空或超时，检查端口是否在监听：

ss -tlnp | grep 9090

预期输出（CentOS/RHEL）：

LISTEN 0 128 *:9090 *:* users:(("prometheus",pid=1234,fd=3))

Ubuntu系统预期输出类似，检查bind地址是否只监听127.0.0.1。

第3步：检查targets状态和采集端点

curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | {job: .labels.job, endpoint: .scrapeUrl, health: .health, lastError: .lastError}'

预期输出：

{
  "job": "node_exporter",
  "endpoint": "http://localhost:9100/metrics",
  "health": "down",
  "lastError": "server returned HTTP status 404"
}
{
  "job": "myapp",
  "endpoint": "http://localhost:8080/metrics",
  "health": "up",
  "lastError": null
}

直接访问指标端点验证：

curl -s http://localhost:9100/metrics | head -20

预期输出：

# HELP node_cpu_seconds_total Seconds the CPUs spent in each mode.
# TYPE node_cpu_seconds_total counter
node_cpu_seconds_total{cpu="0",mode="idle"} 12345.67

如果端点返回404，说明metrics路径配置错误。

第4步：检查Prometheus自身采集配置

grep -A 20 "scrape_configs:" /etc/prometheus/prometheus.yml

预期输出（CentOS/RHEL和Ubuntu路径相同）：

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']
    scrape_interval: 15s

注意scrape_interval和scrape_timeout的设置，timeout不能大于interval，否则采集必然失败。

第5步：检查磁盘IO和存储容量

df -h /data/prometheus # 检查数据目录所在磁盘

预期输出：

Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1       100G   45G   55G  45% /data/prometheus

磁盘使用率超过80%会导致写入失败。

iostat -x 1 5 # 检查磁盘IO是否打满

Prometheus写入是顺序IO，但如果队列持续积压，说明磁盘性能不足。

第6步：检查内存使用和GC情况

ps -o pid,vsz,rss,pmem,comm -p $(pgrep prometheus)

预期输出：

  PID   VSZ  RSS %MEM COMMAND
 1234 512000 420000  2.1 prometheus

如果RSS接近系统物理内存，OOM Killer会杀掉进程：

dmesg | grep -i "prometheus.*killed" | tail -5

检查启动参数中的storage.tsdb.retention.time和storage.memory.bytes：

grep -E "retention|memory" /etc/prometheus/prometheus.yml

预期输出：

  --storage.tsdb.retention.time=15d
  --storage.tsdb.wal-compression  # 启用WAL压缩节省内存

第7步：验证规则文件语法和告警配置

/opt/prometheus/promtool check rules /etc/prometheus/rules/*.yml

预期输出：

Checking rules/alerts.yml
  SUCCESS: 1 rules found

如果规则文件有语法错误：

Checking rules/bad.yml
  FAILED:
  - group "BadGroup": 1: parse error: unexpected end of expression

检查告警规则中的PromQL表达式是否有效：

curl -G --data-urlencode 'query=rate(http_requests_total[5m])' http://localhost:9090/api/v1/query

预期输出包含"status":"success"。

三、常见问题FAQ

Q: Prometheus页面能打开但查询不到任何数据，显示"No data points"，怎么回事？

A: 八成是scrape_interval设置太长，查询时间范围落在采集间隔之外。比如scrape_interval是1m，你查最近30秒的数据当然没有。另一个常见原因是__name__指标名过滤写错了，PromQL里写的是rate(http_request_total)而不是rate(http_requests_total)，单复数搞混。跑一下curl http://localhost:9090/api/v1/label/__name__/values看看实际有哪些指标名。

Q: 服务启动正常，targets也显示up，但指标数据只有几小时就消失了，保留时间还没到怎么回事？

A: 先用curl http://localhost:9090/api/v1/query?query=up查一下数据的时间戳，如果全是陈旧数据说明是远程写入问题。如果时间戳正常但数据被截断，检查storage.tsdb.retention.time和retention.size两个参数，Ubuntu和CentOS的默认配置都是15天。另外确认不是有人手动删除了块文件，路径一般在/data/prometheus/wal和/data/prometheus/01XXXX目录下。

Q: Prometheus进程莫名被OOM杀掉，日志里没有明显异常，怎么排查？

A: 先看dmesg有没有Prometheus被kill的记录。内存爆通常两个原因：一个是指标基数爆炸，某张指标标签组合太多导致内存暴涨；另一个是远程写入背压，remote_write那边故障导致写缓冲堆积。跑一下这个命令看指标基数：

curl -s http://localhost:9090/api/v1/query?query='count({__name__=~".+"}) by (__name__)' | jq '.data.result | sort_by(.value[1] | tonumber) | reverse | .[:10]'

找出基数最大的指标，考虑用relabel_configs过滤掉无用的标签组合，或者调低max_samples_per_send参数缓解背压。

Q: 修改prometheus.yml后reload不生效，必须重启吗？

A: 不需要重启。发送SIGHUP信号即可：kill -HUP $(pgrep prometheus)。但要注意配置文件里的静态配置可以热加载，storage.tsdb相关参数和--web.enable-lifecycle=false时必须重启。另外targets配置改了不会立即生效，要等下一个scrape周期（约15秒）。

四、总结

Prometheus排错的核心思路就三条：先确认进程活着、端口通着、端点能访问；再看采集链路有没有报错，target状态是否down；最后查存储和内存资源有没有瓶颈。80%的故障逃不出配置写错、端点挂了、资源不足这三个坑。

延伸阅读：官方排错文档https://prometheus.io/docs/prometheus/latest/querying/api/、TSDB存储原理https://gank.me/post/2023/prometheus-tsdb-internals、高基数指标处理实战。

服务公告

Prometheus：Prometheus故障排查

发布时间:2026-04-25 12:01

一、前言

二、操作步骤

第1步：检查Prometheus进程状态和端口监听

第2步：验证Prometheus Web接口是否可达

第3步：检查targets状态和采集端点

第4步：检查Prometheus自身采集配置

第5步：检查磁盘IO和存储容量

第6步：检查内存使用和GC情况

第7步：验证规则文件语法和告警配置

三、常见问题FAQ

四、总结

相关推荐