Health-check-driven DNS failover with automatic primary/secondary switching and recovery detection. Cloudflare integration for zero-downtime routing.

DNS Failover

DNS failover ensures your sites stay online even when a shield server goes down. When Lumos Gate detects that a primary server is unreachable, it automatically updates DNS records to route traffic to a secondary server. When the primary recovers, traffic is routed back -- all without manual intervention.

This feature requires at least two servers assigned to a domain. See Multiple Servers for setup instructions.

How It Works

The WebSocket server runs a maintenance loop every 5 minutes. During each cycle, it triggers a health check which performs two checks: DNS failover and SSL certificate expiry. The failover check follows this lifecycle:

Detection

The health check loads all domains that have failover configured (i.e., assigned to multiple servers with a DNS provider integration).
For each domain, the function evaluates the primary server's health using two criteria:
- Status: Is the server status offline or error?
- Last seen: Has the server been unseen for more than 120 seconds (the health timeout constant)?
If either condition is true, the primary is considered down.

Failover

When the primary is down and the domain is not already in failed_over state, the system looks for a healthy secondary server.
The DNS record is updated via the Cloudflare API to point to the secondary server's IP address.
The failoverStatus for the domain is set to failed_over in the database.
A failover event is recorded in the failover events log.
A failover_triggered notification is dispatched to the account owner.

Recovery

On subsequent health check cycles, if the primary server is back online (agent reconnected, status healthy, last seen within 120 seconds) and the domain is in failed_over state, the system triggers recovery.
The DNS record is updated back to the primary server's IP address.
The failoverStatus is set back to active.
A recovery event is recorded in the failover events log.
A server_recovered notification is dispatched.

Normal operation:
  User -> DNS (A: primary shield IP) -> Primary Shield VPS -> Origin

After failover:
  User -> DNS (A: secondary shield IP) -> Secondary Shield VPS -> Origin

After recovery:
  User -> DNS (A: primary shield IP) -> Primary Shield VPS -> Origin

Prerequisites

Before setting up failover, you need:

Two or more shield servers -- Each in a different region or data center for true redundancy. See Multiple Servers for setup instructions.
A Pro or Enterprise plan -- The Free plan only supports 1 server. See Plans for the full comparison.
DNS provider integration -- Currently Cloudflare is supported. You need a Cloudflare API token with Zone:DNS:Edit permissions.
Domain assigned to multiple servers -- The domain must be assigned to at least two servers. The first server is treated as the primary; the rest are secondaries.

Setting Up Failover

Step 1: Add Multiple Servers

Add at least two shield servers in the dashboard. For best redundancy, choose servers in different geographic regions and from different VPS providers.

Example setup:
  Primary:   VPS in Frankfurt (EU)
  Secondary: VPS in Ashburn (US)

Install the Lumos Agent on each server. Verify both servers show as Online in Dashboard -> Servers before proceeding.

Step 2: Configure DNS Provider

Navigate to Dashboard -> Settings and configure your Cloudflare API credentials:

Create a Cloudflare API token at dash.cloudflare.com/profile/api-tokens
Grant the token Zone:DNS:Edit permission for the zone(s) you want failover on
Enter the token and select the zone in the Lumos Gate settings

This allows Lumos Gate to update DNS records automatically during failover events.

Step 3: Assign Domain to Multiple Servers

When creating or editing a domain, assign it to both your primary and secondary servers. The first server in the list is treated as the primary.

Each assigned server receives the full domain configuration, including origin IPs, SSL settings, WAF rules, and bot protection. The secondary server is not idle -- it has HAProxy running with the correct config, SSL certificates provisioned, and WAF rules active. When failover triggers, traffic switches without any cold-start delay.

Step 4: Point DNS Through Cloudflare

Ensure your domain's DNS is managed through Cloudflare and the A record points to your primary shield server's IP address. See DNS Setup for detailed instructions.

Tip: Set a low TTL (e.g., 300 seconds) on your A record. This ensures DNS resolvers pick up failover changes quickly. A TTL of 300 seconds means most resolvers refresh within 5 minutes.

Failover Behavior

Event	Action	DNS TTL
Primary goes offline	DNS updated to secondary IP	Low TTL for fast propagation
Primary recovers	DNS updated back to primary IP	Restored to normal TTL
Both servers offline	No DNS change (last known good)	Unchanged
Secondary goes offline (primary healthy)	No action needed	Unchanged

Timing

WebSocket disconnect detection -- When the agent's WebSocket connection drops, the WebSocket server detects it immediately and sends a server_down notification. This does not wait for the 5-minute cycle.
Health check cycle -- Every 5 minutes, the maintenance loop runs the failover check. This is the safety net that catches cases where the connection appears alive but the server is not functioning correctly, or where the server has become stale.
Health timeout -- A server is considered down if it has not been seen for 120 seconds (2 minutes). This is the health timeout constant in the failover logic.
DNS propagation -- Depends on your TTL setting. With a 300-second TTL, most resolvers pick up the change within 5 minutes.
Total failover time -- Typically 5-10 minutes from the moment a server goes down to traffic flowing through the secondary. The worst case is: up to 5 minutes for the maintenance cycle to run, plus DNS propagation time.

Warning: Some DNS resolvers (especially ISP resolvers) ignore low TTLs and cache aggressively. Even with a 300-second TTL, a small percentage of end users may experience longer switchover times. There is no way to control this from the Lumos Gate side.

Monitoring Failover Events

Failover events are logged and visible in the dashboard. Each event records:

The domain affected
The source IP (failed server) and target IP (failover destination)
The reason ("Primary server down" or "Primary server recovered")
Timestamp of the event

You can also monitor failover activity through notifications -- the failover_triggered and server_recovered alerts give you real-time awareness of all DNS switches.

Notifications

Failover events trigger notifications through your configured delivery channels (email and/or webhook):

Alert	When it fires	Information included
`server_down`	Immediately when agent WebSocket disconnects	Server name, server ID
`failover_triggered`	When DNS is switched from primary to secondary	Domain name, from IP, to IP
`server_recovered`	When primary comes back online and DNS is restored	Domain name, primary IP

Configure notification delivery in Dashboard -> Settings -> Notifications. At minimum, enable server_down and failover_triggered for full visibility into failover events.

Testing Failover

It is a good practice to test failover before relying on it in production:

Verify both servers are Online in the dashboard
Stop the Lumos Agent on your primary server: systemctl stop lumos-agent
Within 5-10 minutes, you should receive a server_down notification followed by a failover_triggered notification
Verify DNS now points to the secondary IP: dig +short yourdomain.com
Verify your site is accessible through the secondary server
Start the agent again: systemctl start lumos-agent
Within 5-10 minutes, you should receive a server_recovered notification
Verify DNS is back to the primary IP

Tip: During testing, keep dig +short yourdomain.com running in a loop (watch -n 5 dig +short yourdomain.com) to observe the DNS switch in real time.

Limitations

DNS propagation delay -- Even with low TTLs, some DNS resolvers cache aggressively. End users with cached DNS entries may experience downtime until their resolver refreshes.
Single DNS provider -- Currently only Cloudflare is supported as a DNS provider for automated failover. Additional providers (GoDaddy, Namecheap, Route53) are planned for future releases.
Health check interval -- The 5-minute maintenance cycle means there can be up to 5 minutes between a server failure and the failover being triggered. Immediate WebSocket disconnection detection helps reduce the notification delay, but the DNS switch itself only happens during the maintenance cycle.
No weighted routing -- Failover is all-or-nothing. There is no gradual traffic shifting or weighted DNS. Traffic goes entirely to the primary or entirely to the secondary.
Single secondary selection -- When multiple secondaries are available, the system selects the first healthy one it finds. There is no preference ordering among secondaries.

Troubleshooting

Problem	Possible cause	Solution
Failover never triggers	DNS provider not configured	Check Settings for Cloudflare credentials
Failover never triggers	Domain assigned to only 1 server	Assign the domain to at least 2 servers
DNS not updating	Cloudflare API token lacks permissions	Ensure token has `Zone:DNS:Edit` on the correct zone
Slow DNS propagation	High TTL value	Lower your A record TTL to 300 seconds
Recovery not happening	Primary agent not reconnecting	Check agent logs on the primary: `journalctl -u lumos-agent`
Repeated failover/recovery	Server flapping (unstable network)	Investigate network stability on the primary VPS

For more debugging steps, see Troubleshooting.

Next Steps

Multiple Servers -- Set up multi-region shield servers for redundancy
DNS Setup -- Configure DNS records for your domains
Notifications -- Configure email and webhook alerts for failover events
SSL/TLS -- Understand how SSL certificates work across failover
Troubleshooting -- Common failover issues and fixes

DNS Failover

On this page