DNS Failover
Health-check-driven DNS failover with automatic primary/secondary switching and recovery detection. Cloudflare integration for zero-downtime routing.
DNS Failover
DNS failover ensures your sites stay online even when a shield server goes down. When Lumos Gate detects that a primary server is unreachable, it automatically updates DNS records to route traffic to a secondary server. When the primary recovers, traffic is routed back -- all without manual intervention.
This feature requires at least two servers assigned to a domain. See Multiple Servers for setup instructions.
How It Works
The WebSocket server runs a maintenance loop every 5 minutes. During each cycle, it triggers a health check which performs two checks: DNS failover and SSL certificate expiry. The failover check follows this lifecycle:
Detection
- The health check loads all domains that have failover configured (i.e., assigned to multiple servers with a DNS provider integration).
- For each domain, the function evaluates the primary server's health using two criteria:
- Status: Is the server status
offlineorerror? - Last seen: Has the server been unseen for more than 120 seconds (the health timeout constant)?
- Status: Is the server status
- If either condition is true, the primary is considered down.
Failover
- When the primary is down and the domain is not already in
failed_overstate, the system looks for a healthy secondary server. - The DNS record is updated via the Cloudflare API to point to the secondary server's IP address.
- The
failoverStatusfor the domain is set tofailed_overin the database. - A failover event is recorded in the failover events log.
- A
failover_triggerednotification is dispatched to the account owner.
Recovery
- On subsequent health check cycles, if the primary server is back online (agent reconnected, status healthy, last seen within 120 seconds) and the domain is in
failed_overstate, the system triggers recovery. - The DNS record is updated back to the primary server's IP address.
- The
failoverStatusis set back toactive. - A recovery event is recorded in the failover events log.
- A
server_recoverednotification is dispatched.
Normal operation:
User -> DNS (A: primary shield IP) -> Primary Shield VPS -> Origin
After failover:
User -> DNS (A: secondary shield IP) -> Secondary Shield VPS -> Origin
After recovery:
User -> DNS (A: primary shield IP) -> Primary Shield VPS -> OriginPrerequisites
Before setting up failover, you need:
- Two or more shield servers -- Each in a different region or data center for true redundancy. See Multiple Servers for setup instructions.
- A Pro or Enterprise plan -- The Free plan only supports 1 server. See Plans for the full comparison.
- DNS provider integration -- Currently Cloudflare is supported. You need a Cloudflare API token with
Zone:DNS:Editpermissions. - Domain assigned to multiple servers -- The domain must be assigned to at least two servers. The first server is treated as the primary; the rest are secondaries.
Setting Up Failover
Step 1: Add Multiple Servers
Add at least two shield servers in the dashboard. For best redundancy, choose servers in different geographic regions and from different VPS providers.
Example setup:
Primary: VPS in Frankfurt (EU)
Secondary: VPS in Ashburn (US)Install the Lumos Agent on each server. Verify both servers show as Online in Dashboard -> Servers before proceeding.
Step 2: Configure DNS Provider
Navigate to Dashboard -> Settings and configure your Cloudflare API credentials:
- Create a Cloudflare API token at dash.cloudflare.com/profile/api-tokens
- Grant the token
Zone:DNS:Editpermission for the zone(s) you want failover on - Enter the token and select the zone in the Lumos Gate settings
This allows Lumos Gate to update DNS records automatically during failover events.
Step 3: Assign Domain to Multiple Servers
When creating or editing a domain, assign it to both your primary and secondary servers. The first server in the list is treated as the primary.
Each assigned server receives the full domain configuration, including origin IPs, SSL settings, WAF rules, and bot protection. The secondary server is not idle -- it has HAProxy running with the correct config, SSL certificates provisioned, and WAF rules active. When failover triggers, traffic switches without any cold-start delay.
Step 4: Point DNS Through Cloudflare
Ensure your domain's DNS is managed through Cloudflare and the A record points to your primary shield server's IP address. See DNS Setup for detailed instructions.
Tip: Set a low TTL (e.g., 300 seconds) on your A record. This ensures DNS resolvers pick up failover changes quickly. A TTL of 300 seconds means most resolvers refresh within 5 minutes.
Failover Behavior
| Event | Action | DNS TTL |
|---|---|---|
| Primary goes offline | DNS updated to secondary IP | Low TTL for fast propagation |
| Primary recovers | DNS updated back to primary IP | Restored to normal TTL |
| Both servers offline | No DNS change (last known good) | Unchanged |
| Secondary goes offline (primary healthy) | No action needed | Unchanged |
Timing
- WebSocket disconnect detection -- When the agent's WebSocket connection drops, the WebSocket server detects it immediately and sends a
server_downnotification. This does not wait for the 5-minute cycle. - Health check cycle -- Every 5 minutes, the maintenance loop runs the failover check. This is the safety net that catches cases where the connection appears alive but the server is not functioning correctly, or where the server has become stale.
- Health timeout -- A server is considered down if it has not been seen for 120 seconds (2 minutes). This is the health timeout constant in the failover logic.
- DNS propagation -- Depends on your TTL setting. With a 300-second TTL, most resolvers pick up the change within 5 minutes.
- Total failover time -- Typically 5-10 minutes from the moment a server goes down to traffic flowing through the secondary. The worst case is: up to 5 minutes for the maintenance cycle to run, plus DNS propagation time.
Warning: Some DNS resolvers (especially ISP resolvers) ignore low TTLs and cache aggressively. Even with a 300-second TTL, a small percentage of end users may experience longer switchover times. There is no way to control this from the Lumos Gate side.
Monitoring Failover Events
Failover events are logged and visible in the dashboard. Each event records:
- The domain affected
- The source IP (failed server) and target IP (failover destination)
- The reason (
"Primary server down"or"Primary server recovered") - Timestamp of the event
You can also monitor failover activity through notifications -- the failover_triggered and server_recovered alerts give you real-time awareness of all DNS switches.
Notifications
Failover events trigger notifications through your configured delivery channels (email and/or webhook):
| Alert | When it fires | Information included |
|---|---|---|
server_down | Immediately when agent WebSocket disconnects | Server name, server ID |
failover_triggered | When DNS is switched from primary to secondary | Domain name, from IP, to IP |
server_recovered | When primary comes back online and DNS is restored | Domain name, primary IP |
Configure notification delivery in Dashboard -> Settings -> Notifications. At minimum, enable server_down and failover_triggered for full visibility into failover events.
Testing Failover
It is a good practice to test failover before relying on it in production:
- Verify both servers are Online in the dashboard
- Stop the Lumos Agent on your primary server:
systemctl stop lumos-agent - Within 5-10 minutes, you should receive a
server_downnotification followed by afailover_triggerednotification - Verify DNS now points to the secondary IP:
dig +short yourdomain.com - Verify your site is accessible through the secondary server
- Start the agent again:
systemctl start lumos-agent - Within 5-10 minutes, you should receive a
server_recoverednotification - Verify DNS is back to the primary IP
Tip: During testing, keep
dig +short yourdomain.comrunning in a loop (watch -n 5 dig +short yourdomain.com) to observe the DNS switch in real time.
Limitations
- DNS propagation delay -- Even with low TTLs, some DNS resolvers cache aggressively. End users with cached DNS entries may experience downtime until their resolver refreshes.
- Single DNS provider -- Currently only Cloudflare is supported as a DNS provider for automated failover. Additional providers (GoDaddy, Namecheap, Route53) are planned for future releases.
- Health check interval -- The 5-minute maintenance cycle means there can be up to 5 minutes between a server failure and the failover being triggered. Immediate WebSocket disconnection detection helps reduce the notification delay, but the DNS switch itself only happens during the maintenance cycle.
- No weighted routing -- Failover is all-or-nothing. There is no gradual traffic shifting or weighted DNS. Traffic goes entirely to the primary or entirely to the secondary.
- Single secondary selection -- When multiple secondaries are available, the system selects the first healthy one it finds. There is no preference ordering among secondaries.
Troubleshooting
| Problem | Possible cause | Solution |
|---|---|---|
| Failover never triggers | DNS provider not configured | Check Settings for Cloudflare credentials |
| Failover never triggers | Domain assigned to only 1 server | Assign the domain to at least 2 servers |
| DNS not updating | Cloudflare API token lacks permissions | Ensure token has Zone:DNS:Edit on the correct zone |
| Slow DNS propagation | High TTL value | Lower your A record TTL to 300 seconds |
| Recovery not happening | Primary agent not reconnecting | Check agent logs on the primary: journalctl -u lumos-agent |
| Repeated failover/recovery | Server flapping (unstable network) | Investigate network stability on the primary VPS |
For more debugging steps, see Troubleshooting.
Next Steps
- Multiple Servers -- Set up multi-region shield servers for redundancy
- DNS Setup -- Configure DNS records for your domains
- Notifications -- Configure email and webhook alerts for failover events
- SSL/TLS -- Understand how SSL certificates work across failover
- Troubleshooting -- Common failover issues and fixes