服务公告

服务公告 > 综合新闻 > Ansible批量管理100台服务器

Ansible批量管理100台服务器

发布时间:2026-04-20 20:56
搞过的人都知道,最烦的是手动一台台登录服务器改配置,凌晨2点改到凌晨6点,改完还有遗漏,生产故障就这么来了。本文讲清楚怎么用Ansible搞定100台服务器的批量管理,不整花活,直接上实战。

一、前言

当你手上有50台、100台服务器要统一改配置、打补丁、部署服务的时候,一台台ssh上去手工操作就是找死。效率低不说,还容易出错。今天讲清楚怎么用Ansible搭一套批量管理平台,从安装配置到实际跑Playbook,覆盖CentOS和Ubuntu两大主流发行版。

二、操作步骤

第1步:确认Python环境和安装Ansible

先检查你管理节点的环境,Ansible基于Python,需要Python 2.7或3.5以上。

# 检查Python版本(CentOS/RHEL) $ python3 --version Python 3.9.16 # Ubuntu直接用apt装 $ sudo apt update && sudo apt install -y python3 python3-pip # 用pip安装Ansible(推荐方式,版本可控) $ pip3 install ansible ansible-core # 验证安装 $ ansible --version ansible [core 2.14.1] config file = /etc/ansible/ansible.cfg configured module search path = ['/root/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules'] ansible python module location = /usr/lib/python3.9/site-packages/ansible python version = 3.9.1

第2步:创建SSH免密登录(关键步骤)

批量管理的前提是Ansible节点能免密登录到被管理主机。生成密钥对,把公钥批量分发到所有服务器。

# 生成密钥对,一路回车用默认配置 $ ssh-keygen -t ed25519 -C "ansible@control-node" -f ~/.ssh/ansible_key # 手动测试一台先登录(首次需要输入密码) $ ssh-copy-id -i ~/.ssh/ansible_key.pub root@192.168.1.101 /usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: /root/.ssh/ansible_key.pub /usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s) root@192.168.1.101's password: Number of key(s) added: 1 # 验证免密登录成功 $ ssh -i ~/.ssh/ansible_key root@192.168.1.101 "hostname -I" 192.168.1.101

第3步:配置Ansible Inventory资产清单

把所有要管理的服务器写进清单文件,分组管理便于批量操作。

# 创建项目目录 $ mkdir -p /opt/ansible-projects/web-cluster && cd /opt/ansible-projects/web-cluster # 编辑inventory文件 $ cat > inventory.ini << 'EOF' [webservers] 192.168.1.101 ansible_user=root ansible_ssh_private_key_file=~/.ssh/ansible_key 192.168.1.102 ansible_user=root ansible_ssh_private_key_file=~/.ssh/ansible_key 192.168.1.103 ansible_user=root ansible_ssh_private_key_file=~/.ssh/ansible_key [databases] 192.168.1.111 ansible_user=root ansible_ssh_private_key_file=~/.ssh/ansible_key 192.168.1.112 ansible_user=root ansible_ssh_private_key_file=~/.ssh/ansible_key [all:vars] ansible_port=22 ansible_python_interpreter=/usr/bin/python3 EOF # 测试能否ping通所有主机 $ ansible all -i inventory.ini -m ping 192.168.1.101 | SUCCESS => { "ansible_facts": { "discovered_interpreter_python": "/usr/bin/python3" }, "changed": false, "ping": "pong" } 192.168.1.102 | SUCCESS => { "changed": false, "ping": "pong" } 192.168.1.111 | SUCCESS => { "changed": false, "ping": "pong" }

第4步:配置ansible.cfg全局参数

# 创建ansible.cfg优化配置 $ cat > ansible.cfg << 'EOF' [defaults] inventory = inventory.ini host_key_checking = False timeout = 30 gather_facts = True [privilege_escalation] become = True become_method = sudo become_user = root become_ask_pass = False [ssh_connection] pipelining = True ssh_args = -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no EOF # 验证配置生效 $ ansible-config dump | grep HOST_KEY_CHECKING HOST_KEY_CHECKING(default) = False

第5步:编写第一个Playbook——批量安装Nginx

写一个实际有用的Playbook,在所有webservers组机器上安装Nginx并启动。

$ cat > install-nginx.yml << 'EOF' --- - name: Install and configure Nginx on web servers hosts: webservers become: true tasks: - name: Install Nginx (CentOS/RHEL) ansible.builtin.yum: name: nginx state: present when: ansible_os_family == "RedHat" - name: Install Nginx (Ubuntu/Debian) ansible.builtin.apt: name: nginx state: present update_cache: yes when: ansible_os_family == "Debian" - name: Start and enable Nginx service ansible.builtin.service: name: nginx state: started enabled: yes - name: Check Nginx status ansible.builtin.shell: systemctl status nginx | head -5 register: nginx_status changed_when: false - name: Display Nginx status ansible.builtin.debug: var: nginx_status.stdout_lines EOF # 执行Playbook $ ansible-playbook install-nginx.yml -i inventory.ini PLAY [Install and configure Nginx on web servers] ************************************* TASK [Gathering Facts] ***************************************************************** ok: [192.168.1.101] ok: [192.168.1.102] ok: [192.168.1.103] TASK [Install Nginx (CentOS/RHEL)] ***************************************************** changed: [192.168.1.101] changed: [192.168.1.102] changed: [192.168.1.103] TASK [Start and enable Nginx service] ************************************************* ok: [192.168.1.101] ok: [192.168.1.102] ok: [192.168.1.103] TASK [Check Nginx status] ************************************************************* 192.168.1.101 | CHANGED | rc=0 >> ● nginx.service - The nginx HTTP and reverse proxy server Loaded: loaded (/usr/lib/systemd/system/nginx.service; enabled; vendor preset: disabled) Active: active (running) PLAY RECAP ***************************************************************************** 192.168.1.101 : ok=5 changed=1 unreachable=0 failed=0 192.168.1.102 : ok=5 changed=1 unreachable=0 failed=0 192.168.1.103 : ok=5 changed=1 unreachable=0 failed=0

第6步:编写进阶Playbook——批量推送配置文件

实际场景里不仅要装软件,还要统一配置文件。用template和copy模块批量推送。

# 创建templates和files目录 $ mkdir -p templates files # 准备Nginx配置文件模板(变量用Jinja2语法) $ cat > templates/nginx.conf.j2 << 'EOF' server { listen {{ nginx_port | default(80) }}; server_name {{ server_name | default('_') }}; location / { root {{ web_root | default('/usr/share/nginx/html') }}; index index.html index.htm; } access_log /var/log/nginx/{{ inventory_hostname }}_access.log; error_log /var/log/nginx/{{ inventory_hostname }}_error.log; } EOF # 创建推送配置的Playbook $ cat > deploy-config.yml << 'EOF' --- - name: Deploy Nginx configuration to web servers hosts: webservers become: true vars: nginx_port: 8080 web_root: /var/www/html tasks: - name: Backup existing nginx.conf ansible.builtin.copy: src: /etc/nginx/nginx.conf dest: /etc/nginx/nginx.conf.backup-{{ ansible_date_time.epoch }} remote_src: yes changed_when: true - name: Deploy new nginx.conf from template ansible.builtin.template: src: templates/nginx.conf.j2 dest: /etc/nginx/nginx.conf validate: '/usr/sbin/nginx -t' mode: '0644' - name: Test nginx configuration ansible.builtin.command: nginx -t register: nginx_test changed_when: false - name: Reload Nginx to apply new config ansible.builtin.service: name: nginx state: reloaded EOF # 执行配置推送 $ ansible-playbook deploy-config.yml PLAY [Deploy Nginx configuration to web servers] ************************************* TASK [Gathering Facts] ***************************************************************** ok: [192.168.1.101] ok: [192.168.1.102] ok: [192.168.1.103] TASK [Backup existing nginx.conf] ***************************************************** changed: [192.168.1.101] changed: [192.168.1.102] changed: [192.168.1.103] TASK [Deploy new nginx.conf from template] ******************************************* changed: [192.168.1.101] changed: [192.168.1.102] changed: [192.168.1.103] TASK [Test nginx configuration] ******************************************************* ok: [192.168.1.101] ok: [192.168.1.102] ok: [192.168.1.103] TASK [Reload Nginx to apply new config] ********************************************** changed: [192.168.1.101] changed: [192.168.1.102] changed: [192.168.1.103] PLAY RECAP ***************************************************************************** 192.168.1.101 : ok=5 changed=4 unreachable=0 failed=0 192.168.1.102 : ok=5 changed=4 unreachable=0 failed=0 192.168.1.103 : ok=5 changed=4 unreachable=0 failed=0

第7步:使用批量命令和Ad-Hoc模式

不需要写完整Playbook的时候,直接用Ad-Hoc命令批量执行单次操作。

# 批量查看所有服务器负载 $ ansible all -i inventory.ini -m shell -a "uptime && free -h" 192.168.1.101 | CHANGED | rc=0 >> 10:32:15 up 45 days, 3:22, 2 users, load average: 0.52, 0.48, 0.51 total used free shared buff/cache available Mem: 7.6Gi 2.1Gi 3.2Gi 125Mi 2.3Gi 5.2Gi Swap: 2.0Gi 0B 2.0Gi # 批量查看磁盘使用情况 $ ansible all -i inventory.ini -m command -a "df -h | grep -E '/$|Avail'" 192.168.1.102 | CHANGED | rc=0 >> /dev/vda1 50G 18G 30G 38% / Filesystem Avail /dev/sda1 100G 50G 50G 50% / # 批量重启某组服务 $ ansible webservers -i inventory.ini -m systemd -a "name=nginx state=restarted" 192.168.1.101 | CHANGED => { "changed": true, "name": "nginx", "state": "started" }

三、常见问题FAQ

Q1: 管理100台服务器的时候执行很慢,怎么提速?

默认SSH串行连接当然慢。开启并行执行和SSH长连接,生产环境50-100台毫无压力。

# 在ansible.cfg中启用并行 $ cat >> ansible.cfg << 'EOF' [defaults] forks = 50 [ssh_connection] pipelining = True control_path = /tmp/ansible-ssh-%%h-%%p-%%r EOF # 或者命令行直接指定并行数 $ ansible-playbook install-nginx.yml -i inventory.ini --forks 50 # 首次运行会建立SSH长连接,后续执行飞快 # 实测:50台服务器安装软件,从5分钟缩短到20秒

Q2: 部分机器因为网络问题执行失败,怎么只重跑失败的主机?

用--limit指定主机,或者让Playbook支持失败重跑。

# 只重跑失败的三台机器 $ ansible-playbook install-nginx.yml -i inventory.ini --limit 192.168.1.101,192.168.1.102,192.168.1.103 # 更聪明的方式:用failedhosts变量 $ cat > retry-failed.yml << 'EOF' --- - name: Retry failed hosts hosts: localhost gather_facts: false tasks: - name: Get failed hosts ansible.builtin.slurp: src: /opt/ansible-projects/web-cluster/retry.yml register: retry_file failed_when: false - name: Retry playbook on failed hosts ansible.builtin.shell: | ansible-playbook install-nginx.yml -i inventory.ini --limit @retry.yml when: retry_file.content is defined EOF # 手动创建retry文件 $ ansible-playbook install-nginx.yml -i inventory.ini # 失败时会在当前目录生成failed_hosts列表 $ ls -la *.retry 192.168.1.105.retry 192.168.1.109.retry $ ansible-playbook install-nginx.yml -i inventory.ini --limit @/opt/ansible-projects/web-cluster/failed_hosts

Q3: 执行的时候想看详细日志或者干跑(check模式)怎么搞?

调试模式和Check模式是两个刚需功能,生产操作前必须先预演。

# Check模式:模拟执行,不实际改动 $ ansible-playbook install-nginx.yml -i inventory.ini --check # 增加详细输出级别 $ ansible-playbook install-nginx.yml -i inventory.ini -v # 基本详情 $ ansible-playbook install-nginx.yml -i inventory.ini -vv # 更详细 $ ansible-playbook install-nginx.yml -i inventory.ini -vvv # 连接详情 $ ansible-playbook install-nginx.yml -i inventory.ini -vvvv # 包含SSH调试 # 记录完整执行日志到文件 $ ansible-playbook install-nginx.yml -i inventory.ini 2>&1 | tee ansible-run-$(date +%Y%m%d-%H%M%S).log # 开启分析模式,看每步耗时 $ ansible-playbook install-nginx.yml -i inventory.ini --start-at-task="Install Nginx (CentOS/RHEL)"

Q4: 不同环境(测试环境、生产环境)的Inventory怎么管理?

生产环境必须分离测试和生产Inventory,用不同的变量组。

# 目录结构 $ mkdir -p inventories/{test,prod} # 测试环境Inventory $ cat > inventories/test/inventory.ini << 'EOF' [webservers] test-web-01 ansible_host=10.0.1.101 test-web-02 ansible_host=10.0.1.102 [databases] test-db-01 ansible_host=10.0.1.111 EOF # 生产环境Inventory $ cat > inventories/prod/inventory.ini << 'EOF' [webservers] prod-web-[01:50] ansible_host=192.168.[1-2].[1-50] [databases] prod-db-01 ansible_host=192.168.10.101 prod-db-02 ansible_host=192.168.10.102 [webservers:vars] max_connections=1000 EOF # 运行时指定Inventory $ ansible-playbook -i inventories/test/inventory.ini deploy-config.yml $ ansible-playbook -i inventories/prod/inventory.ini deploy-config.yml

四、总结

搞Ansible批量管理,核心就三件事:SSH免密打通Inventory分组管理Playbook写清楚。把这三样搞熟练,100台服务器跟管理一台没区别。

核心要点:

  • SSH密钥+ansible_ssh_private_key_file是批量管理的前提
  • Inventory用分组管理,生产/测试分离
  • Playbook中CentOS用yum模块,Ubuntu用apt模块,通过when条件判断
  • ansible.cfg开启forks=50实现并行执行
  • 生产操作前先用--check模式模拟
  • template模块用Jinja2语法,支持变量替换,比copy更灵活

延伸阅读:

  • Ansible官方文档:https://docs.ansible.com/
  • ansible.cfg完整参数说明:配置参数参考
  • AWX/Ansible Tower:大规模管理时的Web UI和权限控制方案
  • ansible-vault:敏感信息加密存储,比如数据库密码、API密钥

赶紧去搭环境练手,用Ansible管100台服务器才是运维的基本功,手工ssh一台台改的时代该结束了。