Varnish 自动化-星耀云
发布时间:2026-04-27 14:01
一、前言
搞过的人都知,Varnish 缓存管理最烦的是半夜被报警叫醒、缓存雪崩手足无措、配置文件改完不知道咋批量reload。手动一条条敲命令不现实,这玩意儿跑在生产环境,一天不自动化就多一分风险。下面分享我干了这些年攒下的几个实战脚本,都是踩过坑磨出来的。
二、操作步骤
第1步:写个缓存清理脚本(支持精准清理)
#!/bin/bash
# varnish_cache_purge.sh - 精准清理指定URL缓存
# 用法: ./varnish_cache_purge.sh /api/v1/users
VARNISH_HOST="127.0.0.1"
VARNISH_PORT="6081"
TARGET_URL="${1:-/}"
if [ -z "$1" ]; then
echo "Usage: $0 "
exit 1
fi
curl -s -X PURGE -H "Host: example.com" "http://${VARNISH_HOST}:${VARNISH_PORT}${TARGET_URL}"
RESULT=$?
if [ $RESULT -eq 0 ]; then
echo "[$(date '+%Y-%m-%d %H:%M:%S')] Purge success: ${TARGET_URL}"
else
echo "[$(date '+%Y-%m-%d %H:%M:%S')] Purge failed: ${TARGET_URL}"
fi
预期输出:
[2026-03-12 03:15:22] Purge success: /api/v1/users
第2步:批量清理同类缓存(正则匹配)
#!/bin/bash
# varnish_batch_purge.sh - 批量清理同类URL
# 清理所有 /api/v2/* 路径
VARNISH_HOST="127.0.0.1"
VARNISH_PORT="6081"
PATTERN="/api/v2/*"
for i in {1..100}; do
curl -s -X PURGE -H "Host: example.com"
"http://${VARNISH_HOST}:${VARNISH_PORT}/api/v2/resource/${i}" &
done
wait
echo "Batch purge completed for pattern: ${PATTERN}"
预期输出:
Batch purge completed for pattern: /api/v2/*
第3步:配置热reload脚本(零中断更新)
#!/bin/bash
# varnish_reload.sh - 零中断reload配置
# 适用于 CentOS/RHEL 和 Ubuntu
VARNISHD_BIN="/usr/sbin/varnishd"
VARNISH_RELOAD="/usr/share/varnish/varnish-reload-vcl"
CONFIG_FILE="/etc/varnish/default.vcl"
PID_FILE="/var/run/varnish.pid"
echo "[$(date)] Starting Varnish config reload..."
# 备份当前配置
cp ${CONFIG_FILE} ${CONFIG_FILE}.bak.$(date +%Y%m%d%H%M%S)
# 语法检查
varnishd -C -f ${CONFIG_FILE} > /dev/null 2>&1
if [ $? -ne 0 ]; then
echo "[ERROR] VCL syntax check failed!"
exit 1
fi
# Ubuntu/Debian 使用 varnish-reload-vcl
if [ -x /usr/share/varnish/varnish-reload-vcl ]; then
/usr/share/varnish/varnish-reload-vcl ${CONFIG_FILE}
# CentOS/RHEL 使用 systemctl reload
elif [ -f /etc/redhat-release ]; then
systemctl reload varnish
fi
sleep 2
# 验证reload是否成功
varnishadm "vcl.list" | grep -q "available"
if [ $? -eq 0 ]; then
echo "[SUCCESS] Varnish config reloaded at $(date)"
else
echo "[ERROR] Reload verification failed"
exit 1
fi
预期输出:
[2026-03-12 04:30:15] Starting Varnish config reload...
[SUCCESS] Varnish config reloaded at Wed Mar 12 04:30:17 CST 2026
第4步:缓存命中率监控脚本(带报警)
#!/bin/bash
# varnish_stats_monitor.sh - 监控缓存命中率
# 低于85%触发报警
VARNISH_HOST="127.0.0.1"
VARNISH_PORT="6082"
ALERT_THRESHOLD=85
WEBHOOK_URL="YOUR_WEBHOOK_URL"
STATS=$(varnishstat -1 -j 2>/dev/null)
HIT=$(echo "$STATS" | jq -r '.cache_hit.value')
MISS=$(echo "$STATS" | jq -r '.cache_miss.value')
HITRATE=$(echo "scale=2; ${HIT}/(${HIT}+${MISS})*100" | bc)
echo "Cache Hit Rate: ${HITRATE}%"
CHECK=$(echo "${HITRATE} < ${ALERT_THRESHOLD}" | bc)
if [ "$CHECK" -eq 1 ]; then
MESSAGE="[ALERT] Varnish hit rate low: ${HITRATE}% (threshold: ${ALERT_THRESHOLD}%)"
echo "$MESSAGE"
curl -s -X POST "${WEBHOOK_URL}"
-H "Content-Type: application/json"
-d "{\"text\": \"${MESSAGE}\"}"
fi
预期输出:
Cache Hit Rate: 92.35%
[ALERT] Varnish hit rate low: 72.15% (threshold: 85%)
第5步:自动清理过期后端脚本(健康检查联动)
#!/bin/bash
# varnish_backend_health.sh - 清理故障后端并切换
# ⚠️ 警告:此脚本修改后端配置,务必先在测试环境验证
VARNISHADM="/usr/bin/varnishadm"
CRITICAL_PATHS="/api/payment /api/auth"
echo "[$(date)] Checking backend health..."
BACKEND_STATUS=$(${VARNISHADM} backend.list -p 2>/dev/null)
echo "$BACKEND_STATUS"
# 检测 unhealthy 后端
UNHEALTHY=$(echo "$BACKEND_STATUS" | grep -c "sick")
if [ "$UNHEALTHY" -gt 0 ]; then
echo "[WARNING] Found ${UNHEALTHY} unhealthy backend(s)"
# 列出故障后端
echo "$BACKEND_STATUS" | grep "sick" | while read line; do
echo " -> $line"
done
# 标记主后端为 healthy(强制恢复)
# 注意:生产环境建议人工介入,不要自动恢复
echo "[INFO] Manual intervention recommended for backend recovery"
fi
# 检查关键路径是否可访问
for path in $CRITICAL_PATHS; do
RESPONSE=$(curl -s -o /dev/null -w "%{http_code}" "http://127.0.0.1:6081${path}")
if [ "$RESPONSE" -ne 200 ] && [ "$RESPONSE" -ne 301 ] && [ "$RESPONSE" -ne 302 ]; then
echo "[CRITICAL] Critical path ${path} returned ${RESPONSE}"
fi
done
预期输出:
[Wed Mar 12 05:00:10 CST 2026] Checking backend health...
Backend name Probes Healthy Last change
backend1 5/5 yes Thu Mar 12 04:45:00 2026
backend2 0/5 sick Thu Mar 12 04:50:15 2026
[WARNING] Found 1 unhealthy backend(s)
-> backend2 0/5 sick Thu Mar 12 04:50:15 2026
[INFO] Manual intervention recommended for backend recovery
第6步:日志分析与自动ban脚本
#!/bin/bash
# varnish_log_analyze.sh - 分析日志并自动封禁异常IP
# ⚠️ 警告:执行ban操作前建议加延迟确认
VARNISHLOG="/usr/bin/varnishlog"
VARNISHADM="/usr/bin/varnishadm"
BAN_THRESHOLD=100 # 5分钟内请求超过此值则ban
BAN_DURATION=3600 # ban时长(秒)
echo "[$(date)] Analyzing Varnish access patterns..."
# 获取最近5分钟高频IP
SUSPICIOUS_IPS=$(${VARNISHLOG} -d -c -m "TxStatus:200" 2>/dev/null |
grep "ReqStart" |
awk '{print $4}' |
cut -d: -f1 |
sort | uniq -c |
sort -rn |
head -10 |
awk -v threshold=$BAN_THRESHOLD '$1 > threshold {print $2}')
if [ -n "$SUSPICIOUS_IPS" ]; then
echo "[WARNING] Detected suspicious IPs:"
echo "$SUSPICIOUS_IPS"
# 自动ban(生产环境建议注释掉这行,改用手动确认)
# echo "$SUSPICIOUS_IPS" | while read ip; do
# ${VARNISHADM} "ban req.http.host ~ ${ip} && req.url ~ ."
# echo "[BANNED] IP: ${ip} for ${BAN_DURATION}s"
# done
echo "[INFO] Dry-run mode: ban commands not executed"
else
echo "[OK] No suspicious activity detected"
fi
预期输出:
[Wed Mar 12 06:15:33 CST 2026] Analyzing Varnish access patterns...
[WARNING] Detected suspicious IPs:
1.2.3.4
5.6.7.8
[INFO] Dry-run mode: ban commands not executed
三、常见问题FAQ
Q1: varnishadm 连接报错 "Connection refused" 怎么破?
老手都知道这一般是 secret 文件权限问题或者 varnishd 没开管理端口。先检查:
# CentOS/RHEL
cat /etc/varnish/secret
systemctl status varnish | grep "Management interface"
# Ubuntu
sudo varnishadm -S /etc/varnish/secret -T 127.0.0.1:6082
如果还是连不上,看是不是 varnishd 启动时没加 -T 参数。生产环境建议开 localhost 管理端口,千万别暴露公网。
Q2: reload 脚本执行了但配置没生效,咋整?
别慌,大概率是 VCL 加载模式问题。执行这个验证:
varnishadm vcl.list
varnishadm vcl.show boot
如果看到多个 vcl 版本,用 varnishadm vcl.use boot 切换。另外检查是不是改错文件了,有些人改了 /etc/varnish/default.vcl 但 varnishd 用的是 -f 指定的其他路径。
Q3: 缓存清理脚本在 CDN 节点上不生效怎么回事?
你这是把边缘节点和源站搞混了。PURGE 请求只清理当前 varnishd 实例的缓存,节点之间不互通。解决方案:
- 用源站 varnish 做 purge,然后逐个刷新边缘
- 或者用商业 CDN 提供的 API 批量刷新
- 、土办法是临时改 Cache-Control 头强制过期
Q4: 监控脚本里的 bc 命令不存在怎么搞?
CentOS/RHEL 装一下:yum install bc -y,Ubuntu:apt install bc -y。或者直接用 awk 算百分比,不用 bc:
HITRATE=$(awk "BEGIN {printf \"%.2f\", ${HIT}/(${HIT}+${MISS})*100}")
四、总结
核心要点就三条:一,缓存清理必须精准,宁可少清不要误清;二,配置 reload 一定要做语法检查再执行;三,监控报警要设阈值,别一点风吹草动就报警。自动化脚本写好了不是扔那儿就不管了,每周要 review 一次日志看有没有异常行为。
延伸阅读: