Ansible批量管理100台服务器
发布时间:2026-04-20 20:56
搞过的人都知道,最烦的是手动一台台登录服务器改配置,凌晨2点改到凌晨6点,改完还有遗漏,生产故障就这么来了。本文讲清楚怎么用Ansible搞定100台服务器的批量管理,不整花活,直接上实战。
一、前言
当你手上有50台、100台服务器要统一改配置、打补丁、部署服务的时候,一台台ssh上去手工操作就是找死。效率低不说,还容易出错。今天讲清楚怎么用Ansible搭一套批量管理平台,从安装配置到实际跑Playbook,覆盖CentOS和Ubuntu两大主流发行版。
二、操作步骤
第1步:确认Python环境和安装Ansible
先检查你管理节点的环境,Ansible基于Python,需要Python 2.7或3.5以上。
# 检查Python版本(CentOS/RHEL)
$ python3 --version
Python 3.9.16
# Ubuntu直接用apt装
$ sudo apt update && sudo apt install -y python3 python3-pip
# 用pip安装Ansible(推荐方式,版本可控)
$ pip3 install ansible ansible-core
# 验证安装
$ ansible --version
ansible [core 2.14.1]
config file = /etc/ansible/ansible.cfg
configured module search path = ['/root/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
ansible python module location = /usr/lib/python3.9/site-packages/ansible
python version = 3.9.1
第2步:创建SSH免密登录(关键步骤)
批量管理的前提是Ansible节点能免密登录到被管理主机。生成密钥对,把公钥批量分发到所有服务器。
# 生成密钥对,一路回车用默认配置
$ ssh-keygen -t ed25519 -C "ansible@control-node" -f ~/.ssh/ansible_key
# 手动测试一台先登录(首次需要输入密码)
$ ssh-copy-id -i ~/.ssh/ansible_key.pub root@192.168.1.101
/usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: /root/.ssh/ansible_key.pub
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s)
root@192.168.1.101's password:
Number of key(s) added: 1
# 验证免密登录成功
$ ssh -i ~/.ssh/ansible_key root@192.168.1.101 "hostname -I"
192.168.1.101
第3步:配置Ansible Inventory资产清单
把所有要管理的服务器写进清单文件,分组管理便于批量操作。
# 创建项目目录
$ mkdir -p /opt/ansible-projects/web-cluster && cd /opt/ansible-projects/web-cluster
# 编辑inventory文件
$ cat > inventory.ini << 'EOF'
[webservers]
192.168.1.101 ansible_user=root ansible_ssh_private_key_file=~/.ssh/ansible_key
192.168.1.102 ansible_user=root ansible_ssh_private_key_file=~/.ssh/ansible_key
192.168.1.103 ansible_user=root ansible_ssh_private_key_file=~/.ssh/ansible_key
[databases]
192.168.1.111 ansible_user=root ansible_ssh_private_key_file=~/.ssh/ansible_key
192.168.1.112 ansible_user=root ansible_ssh_private_key_file=~/.ssh/ansible_key
[all:vars]
ansible_port=22
ansible_python_interpreter=/usr/bin/python3
EOF
# 测试能否ping通所有主机
$ ansible all -i inventory.ini -m ping
192.168.1.101 | SUCCESS => {
"ansible_facts": {
"discovered_interpreter_python": "/usr/bin/python3"
},
"changed": false,
"ping": "pong"
}
192.168.1.102 | SUCCESS => {
"changed": false,
"ping": "pong"
}
192.168.1.111 | SUCCESS => {
"changed": false,
"ping": "pong"
}
第4步:配置ansible.cfg全局参数
# 创建ansible.cfg优化配置
$ cat > ansible.cfg << 'EOF'
[defaults]
inventory = inventory.ini
host_key_checking = False
timeout = 30
gather_facts = True
[privilege_escalation]
become = True
become_method = sudo
become_user = root
become_ask_pass = False
[ssh_connection]
pipelining = True
ssh_args = -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no
EOF
# 验证配置生效
$ ansible-config dump | grep HOST_KEY_CHECKING
HOST_KEY_CHECKING(default) = False
第5步:编写第一个Playbook——批量安装Nginx
写一个实际有用的Playbook,在所有webservers组机器上安装Nginx并启动。
$ cat > install-nginx.yml << 'EOF'
---
- name: Install and configure Nginx on web servers
hosts: webservers
become: true
tasks:
- name: Install Nginx (CentOS/RHEL)
ansible.builtin.yum:
name: nginx
state: present
when: ansible_os_family == "RedHat"
- name: Install Nginx (Ubuntu/Debian)
ansible.builtin.apt:
name: nginx
state: present
update_cache: yes
when: ansible_os_family == "Debian"
- name: Start and enable Nginx service
ansible.builtin.service:
name: nginx
state: started
enabled: yes
- name: Check Nginx status
ansible.builtin.shell: systemctl status nginx | head -5
register: nginx_status
changed_when: false
- name: Display Nginx status
ansible.builtin.debug:
var: nginx_status.stdout_lines
EOF
# 执行Playbook
$ ansible-playbook install-nginx.yml -i inventory.ini
PLAY [Install and configure Nginx on web servers] *************************************
TASK [Gathering Facts] *****************************************************************
ok: [192.168.1.101]
ok: [192.168.1.102]
ok: [192.168.1.103]
TASK [Install Nginx (CentOS/RHEL)] *****************************************************
changed: [192.168.1.101]
changed: [192.168.1.102]
changed: [192.168.1.103]
TASK [Start and enable Nginx service] *************************************************
ok: [192.168.1.101]
ok: [192.168.1.102]
ok: [192.168.1.103]
TASK [Check Nginx status] *************************************************************
192.168.1.101 | CHANGED | rc=0 >>
● nginx.service - The nginx HTTP and reverse proxy server
Loaded: loaded (/usr/lib/systemd/system/nginx.service; enabled; vendor preset: disabled)
Active: active (running)
PLAY RECAP *****************************************************************************
192.168.1.101 : ok=5 changed=1 unreachable=0 failed=0
192.168.1.102 : ok=5 changed=1 unreachable=0 failed=0
192.168.1.103 : ok=5 changed=1 unreachable=0 failed=0
第6步:编写进阶Playbook——批量推送配置文件
实际场景里不仅要装软件,还要统一配置文件。用template和copy模块批量推送。
# 创建templates和files目录
$ mkdir -p templates files
# 准备Nginx配置文件模板(变量用Jinja2语法)
$ cat > templates/nginx.conf.j2 << 'EOF'
server {
listen {{ nginx_port | default(80) }};
server_name {{ server_name | default('_') }};
location / {
root {{ web_root | default('/usr/share/nginx/html') }};
index index.html index.htm;
}
access_log /var/log/nginx/{{ inventory_hostname }}_access.log;
error_log /var/log/nginx/{{ inventory_hostname }}_error.log;
}
EOF
# 创建推送配置的Playbook
$ cat > deploy-config.yml << 'EOF'
---
- name: Deploy Nginx configuration to web servers
hosts: webservers
become: true
vars:
nginx_port: 8080
web_root: /var/www/html
tasks:
- name: Backup existing nginx.conf
ansible.builtin.copy:
src: /etc/nginx/nginx.conf
dest: /etc/nginx/nginx.conf.backup-{{ ansible_date_time.epoch }}
remote_src: yes
changed_when: true
- name: Deploy new nginx.conf from template
ansible.builtin.template:
src: templates/nginx.conf.j2
dest: /etc/nginx/nginx.conf
validate: '/usr/sbin/nginx -t'
mode: '0644'
- name: Test nginx configuration
ansible.builtin.command: nginx -t
register: nginx_test
changed_when: false
- name: Reload Nginx to apply new config
ansible.builtin.service:
name: nginx
state: reloaded
EOF
# 执行配置推送
$ ansible-playbook deploy-config.yml
PLAY [Deploy Nginx configuration to web servers] *************************************
TASK [Gathering Facts] *****************************************************************
ok: [192.168.1.101]
ok: [192.168.1.102]
ok: [192.168.1.103]
TASK [Backup existing nginx.conf] *****************************************************
changed: [192.168.1.101]
changed: [192.168.1.102]
changed: [192.168.1.103]
TASK [Deploy new nginx.conf from template] *******************************************
changed: [192.168.1.101]
changed: [192.168.1.102]
changed: [192.168.1.103]
TASK [Test nginx configuration] *******************************************************
ok: [192.168.1.101]
ok: [192.168.1.102]
ok: [192.168.1.103]
TASK [Reload Nginx to apply new config] **********************************************
changed: [192.168.1.101]
changed: [192.168.1.102]
changed: [192.168.1.103]
PLAY RECAP *****************************************************************************
192.168.1.101 : ok=5 changed=4 unreachable=0 failed=0
192.168.1.102 : ok=5 changed=4 unreachable=0 failed=0
192.168.1.103 : ok=5 changed=4 unreachable=0 failed=0
第7步:使用批量命令和Ad-Hoc模式
不需要写完整Playbook的时候,直接用Ad-Hoc命令批量执行单次操作。
# 批量查看所有服务器负载
$ ansible all -i inventory.ini -m shell -a "uptime && free -h"
192.168.1.101 | CHANGED | rc=0 >>
10:32:15 up 45 days, 3:22, 2 users, load average: 0.52, 0.48, 0.51
total used free shared buff/cache available
Mem: 7.6Gi 2.1Gi 3.2Gi 125Mi 2.3Gi 5.2Gi
Swap: 2.0Gi 0B 2.0Gi
# 批量查看磁盘使用情况
$ ansible all -i inventory.ini -m command -a "df -h | grep -E '/$|Avail'"
192.168.1.102 | CHANGED | rc=0 >>
/dev/vda1 50G 18G 30G 38% /
Filesystem Avail
/dev/sda1 100G 50G 50G 50% /
# 批量重启某组服务
$ ansible webservers -i inventory.ini -m systemd -a "name=nginx state=restarted"
192.168.1.101 | CHANGED => {
"changed": true,
"name": "nginx",
"state": "started"
}
三、常见问题FAQ
Q1: 管理100台服务器的时候执行很慢,怎么提速?
默认SSH串行连接当然慢。开启并行执行和SSH长连接,生产环境50-100台毫无压力。
# 在ansible.cfg中启用并行
$ cat >> ansible.cfg << 'EOF'
[defaults]
forks = 50
[ssh_connection]
pipelining = True
control_path = /tmp/ansible-ssh-%%h-%%p-%%r
EOF
# 或者命令行直接指定并行数
$ ansible-playbook install-nginx.yml -i inventory.ini --forks 50
# 首次运行会建立SSH长连接,后续执行飞快
# 实测:50台服务器安装软件,从5分钟缩短到20秒
Q2: 部分机器因为网络问题执行失败,怎么只重跑失败的主机?
用--limit指定主机,或者让Playbook支持失败重跑。
# 只重跑失败的三台机器
$ ansible-playbook install-nginx.yml -i inventory.ini --limit 192.168.1.101,192.168.1.102,192.168.1.103
# 更聪明的方式:用failedhosts变量
$ cat > retry-failed.yml << 'EOF'
---
- name: Retry failed hosts
hosts: localhost
gather_facts: false
tasks:
- name: Get failed hosts
ansible.builtin.slurp:
src: /opt/ansible-projects/web-cluster/retry.yml
register: retry_file
failed_when: false
- name: Retry playbook on failed hosts
ansible.builtin.shell: |
ansible-playbook install-nginx.yml -i inventory.ini --limit @retry.yml
when: retry_file.content is defined
EOF
# 手动创建retry文件
$ ansible-playbook install-nginx.yml -i inventory.ini
# 失败时会在当前目录生成failed_hosts列表
$ ls -la *.retry
192.168.1.105.retry 192.168.1.109.retry
$ ansible-playbook install-nginx.yml -i inventory.ini --limit @/opt/ansible-projects/web-cluster/failed_hosts
Q3: 执行的时候想看详细日志或者干跑(check模式)怎么搞?
调试模式和Check模式是两个刚需功能,生产操作前必须先预演。
# Check模式:模拟执行,不实际改动
$ ansible-playbook install-nginx.yml -i inventory.ini --check
# 增加详细输出级别
$ ansible-playbook install-nginx.yml -i inventory.ini -v # 基本详情
$ ansible-playbook install-nginx.yml -i inventory.ini -vv # 更详细
$ ansible-playbook install-nginx.yml -i inventory.ini -vvv # 连接详情
$ ansible-playbook install-nginx.yml -i inventory.ini -vvvv # 包含SSH调试
# 记录完整执行日志到文件
$ ansible-playbook install-nginx.yml -i inventory.ini 2>&1 | tee ansible-run-$(date +%Y%m%d-%H%M%S).log
# 开启分析模式,看每步耗时
$ ansible-playbook install-nginx.yml -i inventory.ini --start-at-task="Install Nginx (CentOS/RHEL)"
Q4: 不同环境(测试环境、生产环境)的Inventory怎么管理?
生产环境必须分离测试和生产Inventory,用不同的变量组。
# 目录结构
$ mkdir -p inventories/{test,prod}
# 测试环境Inventory
$ cat > inventories/test/inventory.ini << 'EOF'
[webservers]
test-web-01 ansible_host=10.0.1.101
test-web-02 ansible_host=10.0.1.102
[databases]
test-db-01 ansible_host=10.0.1.111
EOF
# 生产环境Inventory
$ cat > inventories/prod/inventory.ini << 'EOF'
[webservers]
prod-web-[01:50] ansible_host=192.168.[1-2].[1-50]
[databases]
prod-db-01 ansible_host=192.168.10.101
prod-db-02 ansible_host=192.168.10.102
[webservers:vars]
max_connections=1000
EOF
# 运行时指定Inventory
$ ansible-playbook -i inventories/test/inventory.ini deploy-config.yml
$ ansible-playbook -i inventories/prod/inventory.ini deploy-config.yml
四、总结
搞Ansible批量管理,核心就三件事:SSH免密打通、Inventory分组管理、Playbook写清楚。把这三样搞熟练,100台服务器跟管理一台没区别。
核心要点:
- SSH密钥+ansible_ssh_private_key_file是批量管理的前提
- Inventory用分组管理,生产/测试分离
- Playbook中CentOS用yum模块,Ubuntu用apt模块,通过when条件判断
- ansible.cfg开启forks=50实现并行执行
- 生产操作前先用--check模式模拟
- template模块用Jinja2语法,支持变量替换,比copy更灵活
延伸阅读:
赶紧去搭环境练手,用Ansible管100台服务器才是运维的基本功,手工ssh一台台改的时代该结束了。