Skip to main content
Lumos Gate Docs

DNS Failover

Health-check-driven DNS failover with automatic primary/secondary switching and recovery detection. Cloudflare integration for zero-downtime routing.

DNS Failover

DNS failover ensures your sites stay online even when a shield server goes down. When Lumos Gate detects that a primary server is unreachable, it automatically updates DNS records to route traffic to a secondary server. When the primary recovers, traffic is routed back -- all without manual intervention.

This feature requires at least two servers assigned to a domain. See Multiple Servers for setup instructions.

How It Works

The WebSocket server runs a maintenance loop every 5 minutes. During each cycle, it triggers a health check which performs two checks: DNS failover and SSL certificate expiry. The failover check follows this lifecycle:

Detection

  1. The health check loads all domains that have failover configured (i.e., assigned to multiple servers with a DNS provider integration).
  2. For each domain, the function evaluates the primary server's health using two criteria:
    • Status: Is the server status offline or error?
    • Last seen: Has the server been unseen for more than 120 seconds (the health timeout constant)?
  3. If either condition is true, the primary is considered down.

Failover

  1. When the primary is down and the domain is not already in failed_over state, the system looks for a healthy secondary server.
  2. The DNS record is updated via the Cloudflare API to point to the secondary server's IP address.
  3. The failoverStatus for the domain is set to failed_over in the database.
  4. A failover event is recorded in the failover events log.
  5. A failover_triggered notification is dispatched to the account owner.

Recovery

  1. On subsequent health check cycles, if the primary server is back online (agent reconnected, status healthy, last seen within 120 seconds) and the domain is in failed_over state, the system triggers recovery.
  2. The DNS record is updated back to the primary server's IP address.
  3. The failoverStatus is set back to active.
  4. A recovery event is recorded in the failover events log.
  5. A server_recovered notification is dispatched.
Normal operation:
  User -> DNS (A: primary shield IP) -> Primary Shield VPS -> Origin

After failover:
  User -> DNS (A: secondary shield IP) -> Secondary Shield VPS -> Origin

After recovery:
  User -> DNS (A: primary shield IP) -> Primary Shield VPS -> Origin

Prerequisites

Before setting up failover, you need:

  • Two or more shield servers -- Each in a different region or data center for true redundancy. See Multiple Servers for setup instructions.
  • A Pro or Enterprise plan -- The Free plan only supports 1 server. See Plans for the full comparison.
  • DNS provider integration -- Currently Cloudflare is supported. You need a Cloudflare API token with Zone:DNS:Edit permissions.
  • Domain assigned to multiple servers -- The domain must be assigned to at least two servers. The first server is treated as the primary; the rest are secondaries.

Setting Up Failover

Step 1: Add Multiple Servers

Add at least two shield servers in the dashboard. For best redundancy, choose servers in different geographic regions and from different VPS providers.

Example setup:
  Primary:   VPS in Frankfurt (EU)
  Secondary: VPS in Ashburn (US)

Install the Lumos Agent on each server. Verify both servers show as Online in Dashboard -> Servers before proceeding.

Step 2: Configure DNS Provider

Navigate to Dashboard -> Settings and configure your Cloudflare API credentials:

  1. Create a Cloudflare API token at dash.cloudflare.com/profile/api-tokens
  2. Grant the token Zone:DNS:Edit permission for the zone(s) you want failover on
  3. Enter the token and select the zone in the Lumos Gate settings

This allows Lumos Gate to update DNS records automatically during failover events.

Step 3: Assign Domain to Multiple Servers

When creating or editing a domain, assign it to both your primary and secondary servers. The first server in the list is treated as the primary.

Each assigned server receives the full domain configuration, including origin IPs, SSL settings, WAF rules, and bot protection. The secondary server is not idle -- it has HAProxy running with the correct config, SSL certificates provisioned, and WAF rules active. When failover triggers, traffic switches without any cold-start delay.

Step 4: Point DNS Through Cloudflare

Ensure your domain's DNS is managed through Cloudflare and the A record points to your primary shield server's IP address. See DNS Setup for detailed instructions.

Tip: Set a low TTL (e.g., 300 seconds) on your A record. This ensures DNS resolvers pick up failover changes quickly. A TTL of 300 seconds means most resolvers refresh within 5 minutes.

Failover Behavior

EventActionDNS TTL
Primary goes offlineDNS updated to secondary IPLow TTL for fast propagation
Primary recoversDNS updated back to primary IPRestored to normal TTL
Both servers offlineNo DNS change (last known good)Unchanged
Secondary goes offline (primary healthy)No action neededUnchanged

Timing

  • WebSocket disconnect detection -- When the agent's WebSocket connection drops, the WebSocket server detects it immediately and sends a server_down notification. This does not wait for the 5-minute cycle.
  • Health check cycle -- Every 5 minutes, the maintenance loop runs the failover check. This is the safety net that catches cases where the connection appears alive but the server is not functioning correctly, or where the server has become stale.
  • Health timeout -- A server is considered down if it has not been seen for 120 seconds (2 minutes). This is the health timeout constant in the failover logic.
  • DNS propagation -- Depends on your TTL setting. With a 300-second TTL, most resolvers pick up the change within 5 minutes.
  • Total failover time -- Typically 5-10 minutes from the moment a server goes down to traffic flowing through the secondary. The worst case is: up to 5 minutes for the maintenance cycle to run, plus DNS propagation time.

Warning: Some DNS resolvers (especially ISP resolvers) ignore low TTLs and cache aggressively. Even with a 300-second TTL, a small percentage of end users may experience longer switchover times. There is no way to control this from the Lumos Gate side.

Monitoring Failover Events

Failover events are logged and visible in the dashboard. Each event records:

  • The domain affected
  • The source IP (failed server) and target IP (failover destination)
  • The reason ("Primary server down" or "Primary server recovered")
  • Timestamp of the event

You can also monitor failover activity through notifications -- the failover_triggered and server_recovered alerts give you real-time awareness of all DNS switches.

Notifications

Failover events trigger notifications through your configured delivery channels (email and/or webhook):

AlertWhen it firesInformation included
server_downImmediately when agent WebSocket disconnectsServer name, server ID
failover_triggeredWhen DNS is switched from primary to secondaryDomain name, from IP, to IP
server_recoveredWhen primary comes back online and DNS is restoredDomain name, primary IP

Configure notification delivery in Dashboard -> Settings -> Notifications. At minimum, enable server_down and failover_triggered for full visibility into failover events.

Testing Failover

It is a good practice to test failover before relying on it in production:

  1. Verify both servers are Online in the dashboard
  2. Stop the Lumos Agent on your primary server: systemctl stop lumos-agent
  3. Within 5-10 minutes, you should receive a server_down notification followed by a failover_triggered notification
  4. Verify DNS now points to the secondary IP: dig +short yourdomain.com
  5. Verify your site is accessible through the secondary server
  6. Start the agent again: systemctl start lumos-agent
  7. Within 5-10 minutes, you should receive a server_recovered notification
  8. Verify DNS is back to the primary IP

Tip: During testing, keep dig +short yourdomain.com running in a loop (watch -n 5 dig +short yourdomain.com) to observe the DNS switch in real time.

Limitations

  • DNS propagation delay -- Even with low TTLs, some DNS resolvers cache aggressively. End users with cached DNS entries may experience downtime until their resolver refreshes.
  • Single DNS provider -- Currently only Cloudflare is supported as a DNS provider for automated failover. Additional providers (GoDaddy, Namecheap, Route53) are planned for future releases.
  • Health check interval -- The 5-minute maintenance cycle means there can be up to 5 minutes between a server failure and the failover being triggered. Immediate WebSocket disconnection detection helps reduce the notification delay, but the DNS switch itself only happens during the maintenance cycle.
  • No weighted routing -- Failover is all-or-nothing. There is no gradual traffic shifting or weighted DNS. Traffic goes entirely to the primary or entirely to the secondary.
  • Single secondary selection -- When multiple secondaries are available, the system selects the first healthy one it finds. There is no preference ordering among secondaries.

Troubleshooting

ProblemPossible causeSolution
Failover never triggersDNS provider not configuredCheck Settings for Cloudflare credentials
Failover never triggersDomain assigned to only 1 serverAssign the domain to at least 2 servers
DNS not updatingCloudflare API token lacks permissionsEnsure token has Zone:DNS:Edit on the correct zone
Slow DNS propagationHigh TTL valueLower your A record TTL to 300 seconds
Recovery not happeningPrimary agent not reconnectingCheck agent logs on the primary: journalctl -u lumos-agent
Repeated failover/recoveryServer flapping (unstable network)Investigate network stability on the primary VPS

For more debugging steps, see Troubleshooting.

Next Steps

  • Multiple Servers -- Set up multi-region shield servers for redundancy
  • DNS Setup -- Configure DNS records for your domains
  • Notifications -- Configure email and webhook alerts for failover events
  • SSL/TLS -- Understand how SSL certificates work across failover
  • Troubleshooting -- Common failover issues and fixes