freeleaps-ops/docs/Azure_K8s_Node_Addition_Runbook.md
2025-09-04 00:58:59 -07:00

6.2 KiB

Azure Kubernetes Node Addition Runbook

Overview

This runbook provides step-by-step instructions for adding new Azure Virtual Machines to an existing Kubernetes cluster installed via Kubespray.

Prerequisites

  • Access to Azure CLI with appropriate permissions
  • SSH access to the new VM
  • Access to the existing Kubernetes cluster
  • Kubespray installation directory

Pre-Installation Checklist

1. Verify New VM Details

# Get VM details from Azure
az vm show --resource-group <RESOURCE_GROUP> --name <VM_NAME> --query "{name:name,ip:publicIps,privateIp:privateIps}" -o table

2. Verify SSH Access

# Test SSH connection to the new VM
ssh wwwadmin@mathmast.com@<VM_PRIVATE_IP>
# You will be prompted for password

3. Verify Network Connectivity

# From the new VM, test connectivity to existing cluster
ping <EXISTING_MASTER_IP>

Step-by-Step Process

Step 1: Update Ansible Inventory

  1. Navigate to Kubespray directory
cd freeleaps-ops/3rd/kubespray
  1. Edit the inventory file
vim ../cluster/ansible/manifests/inventory.ini
  1. Add the new node to the appropriate group

For a worker node:

[kube_node]
# Existing nodes...
prod-usw2-k8s-freeleaps-worker-nodes-06 ansible_host=<NEW_VM_PRIVATE_IP> ansible_user=wwwadmin@mathmast.com host_name=prod-usw2-k8s-freeleaps-worker-nodes-06

For a master node:

[kube_control_plane]
# Existing nodes...
prod-usw2-k8s-freeleaps-master-03 ansible_host=<NEW_VM_PRIVATE_IP> ansible_user=wwwadmin@mathmast.com etcd_member_name=freeleaps-etcd-03 host_name=prod-usw2-k8s-freeleaps-master-03

Step 2: Verify Inventory Configuration

  1. Check inventory syntax
ansible-inventory -i ../cluster/ansible/manifests/inventory.ini --list
  1. Test connectivity to new node
ansible -i ../cluster/ansible/manifests/inventory.ini kube_node -m ping -kK

Step 3: Run Kubespray Scale Playbook

  1. Execute the scale playbook
cd ../cluster/ansible/manifests
ansible-playbook -i inventory.ini ../../3rd/kubespray/scale.yml -kK -b

Note:

  • -k prompts for SSH password
  • -K prompts for sudo password
  • -b enables privilege escalation

Step 4: Verify Node Addition

  1. Check node status
kubectl get nodes
  1. Verify node is ready
kubectl describe node <NEW_NODE_NAME>
  1. Check node labels
kubectl get nodes --show-labels

Step 5: Post-Installation Verification

  1. Test pod scheduling
# Create a test pod to verify scheduling
kubectl run test-pod --image=nginx --restart=Never
kubectl get pod test-pod -o wide
  1. Check node resources
kubectl top nodes
  1. Verify node components
kubectl get pods -n kube-system -o wide | grep <NEW_NODE_NAME>

Troubleshooting

Common Issues

1. SSH Connection Failed

# Verify VM is running
az vm show --resource-group <RESOURCE_GROUP> --name <VM_NAME> --query "powerState"

# Check network security groups
az network nsg rule list --resource-group <RESOURCE_GROUP> --nsg-name <NSG_NAME>

2. Ansible Connection Failed

# Test with verbose output
ansible -i ../cluster/ansible/manifests/inventory.ini kube_node -m ping -kK -vvv

3. Node Not Ready

# Check node conditions
kubectl describe node <NEW_NODE_NAME>

# Check kubelet logs
kubectl logs -n kube-system kubelet-<NEW_NODE_NAME>

4. Pod Scheduling Issues

# Check node taints
kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints

# Check node capacity
kubectl describe node <NEW_NODE_NAME> | grep -A 10 "Capacity"

Recovery Procedures

If Scale Playbook Fails

  1. Clean up the failed node
kubectl delete node <NEW_NODE_NAME>
  1. Reset the VM
# Reset VM to clean state
az vm restart --resource-group <RESOURCE_GROUP> --name <VM_NAME>
  1. Retry the scale playbook
ansible-playbook -i inventory.ini ../../3rd/kubespray/scale.yml -kK -b

If Node is Stuck in NotReady State

  1. Check kubelet service
ssh wwwadmin@mathmast.com@<VM_PRIVATE_IP>
sudo systemctl status kubelet
  1. Restart kubelet
ssh wwwadmin@mathmast.com@<VM_PRIVATE_IP>
sudo systemctl restart kubelet

Security Considerations

1. Network Security

  • Ensure the new VM is in the correct subnet
  • Verify network security group rules allow cluster communication
  • Check firewall rules if applicable

2. Access Control

  • Use SSH key-based authentication when possible
  • Limit sudo access to necessary commands
  • Monitor node access logs

3. Compliance

  • Ensure the new node meets security requirements
  • Verify all required security patches are applied
  • Check compliance with organizational policies

Monitoring and Maintenance

1. Node Health Monitoring

# Set up monitoring for the new node
kubectl get nodes -o wide
kubectl top nodes

2. Resource Monitoring

# Monitor resource usage
kubectl describe node <NEW_NODE_NAME> | grep -A 5 "Allocated resources"

3. Log Monitoring

# Monitor kubelet logs
kubectl logs -n kube-system kubelet-<NEW_NODE_NAME> --tail=100 -f

Rollback Procedures

If Node Addition Causes Issues

  1. Cordon the node
kubectl cordon <NEW_NODE_NAME>
  1. Drain the node
kubectl drain <NEW_NODE_NAME> --ignore-daemonsets --delete-emptydir-data
  1. Remove the node
kubectl delete node <NEW_NODE_NAME>
  1. Update inventory
# Remove the node from inventory.ini
vim ../cluster/ansible/manifests/inventory.ini

Documentation

Required Information

  • VM name and IP address
  • Resource group and subscription
  • Node role (worker/master)
  • Date and time of addition
  • Person performing the addition

Post-Addition Checklist

  • Node appears in kubectl get nodes
  • Node status is Ready
  • Pods can be scheduled on the node
  • All node components are running
  • Monitoring is configured
  • Documentation is updated

Emergency Contacts

  • Infrastructure Team: [Contact Information]
  • Kubernetes Administrators: [Contact Information]
  • Azure Support: [Contact Information]

Last Updated: [Date] Version: 1.0 Author: [Name]