App Availability and Resiliency

A resilient app is resistant to events like hardware failures or outages. Redundancy, supported by monitoring and scaling, is one of the ways to make your app more resilient, and we provide some built-in features to make this easier.

First, it’s important to understand what your app needs to take care of.

What your app needs to do

Since Fly Proxy does load balancing by default, you need to consider how your app handles:

  • session storage, so that users stay logged in when their session moves to a different server
  • persistent file storage and replication, as discussed in the following paragraphs

A note about Fly Proxy: Fly Proxy does a lot of work in the background, applying handlers, managing traffic and load balancing, and managing connections to services. Fly Proxy also automatically starts and stops Machines based on traffic to your app. The proxy works for process groups with services defined.

For databases, you’ll need to follow the specific recommendations for the database you’re using, including how to set up clusters for replication and failover.

If you’re deploying Fly Postgres, our un-managed database app, then you need to manage backups, recovery, replication, and scaling yourself. For information about high availability with Fly Postgres, refer to High Availability & Global Replication.

When you use Fly Volumes with your app, your app needs to implement replication and backups. Volumes are block devices that live on a single disk array and get mounted directly by your Machine. There’s no magic. And there’s no built-in replication between volumes. If your app provides a clustering mechanism with data replication, like most databases do, we recommend you take advantage of that and run multiple instances with attached volumes. When possible, we place your app’s volumes in different hardware zones within a region to mitigate hardware failures.

This brings us to disaster recovery and being prepared. You should have backups of your data and understand how to recover that data if needed. Fly.io takes daily snapshots of volumes and stores them for 5 days. You can restore a volume’s data from one of these daily snapshots. If your use case requires more frequent backups, then you’ll need to set up tooling and processes to backup your volumes on the required schedule.

Fly.io features for app resiliency

The Fly Platform has features to help make your app more resilient in case of outages or hardware failures.

The features for redundancy and availability are:

These features require or depend on your app configuration in the fly.toml file, including process groups. Every Fly App has at least one process group. An app with no extra process groups defined in fly.toml still uses the app process group by default. Learn more about app configuration.

Redundancy by default on first deploy

Technically, the title of this section should be “Redundancy by default on first deploy or after scaling down to zero Machines”. When you deploy your app for the first time (or after scaling down to zero Machines), you get some default redundancy configurations. These settings help you to deploy a recommended setup without having to think about it. If the defaults aren’t what you’re looking for, then you can still configure your app and processes the way you want. Learn more about scaling Machine CPU and RAM and scaling the number of Machines.

Here’s what Fly Launch (the fly launch or fly deploy command) does, based on your app configuration:

  • creates and starts two Machines for process groups with services, sets automatic start and stop to true and minimum machines running to zero
  • creates and starts one Machine and creates one stopped standby Machine for process groups without services
  • creates and starts only one Machine if the process group has volumes mounted, because there’s no built-in replication between volumes

Two Machines for process groups with services

The most basic way to improve resiliency is to create more than one Machine per process group, and the fly launch and fly deploy commands do this by default for process groups with services. A process group with services has mappings to the external internet or to a private (Flycast) network, so that Fly Proxy can manage its connections and operation.

If a process group has services defined, two Machines are automatically created and started when:

  • you deploy an app for the first time using fly launch or fly deploy,
  • you redeploy an app using fly deploy after scaling down to zero, or
  • you add a new process group with services in the fly.toml file and then run fly deploy

Standby Machines for process groups without services

Standby Machines provide low-effort redundancy for process groups with no services defined. A process group with no services has no mappings to the external internet or to a private (Flycast) network, so Fly Proxy can’t manage its connections and operation.

A standby Machine is a stopped Machine that’s essentially paired to and watching a running Machine. The standby starts only if the paired Machine becomes unavailable.

If a process group doesn’t have services defined, then a standby Machine is automatically created, along with a running Machine, when:

  • you deploy an app for the first time using fly launch or fly deploy,
  • you redeploy an app using fly deploy after scaling down to zero, or
  • you add a new process group in fly.toml and then run fly deploy

The standby Machine doesn’t consume resources or start up until it’s needed.

Remove a standby Machine

When an app or process group has one Machine and one standby Machine, the standby Machine is destroyed first if you scale down to one Machine. The standby Machine isn’t subsequently recreated when you scale up or back down using fly scale count.

If you add services to the process group in fly.toml, then the standby designation is removed from the standby Machine on the next deploy.

Create a standby Machine

You can recreate the standby Machine if you scale down to 0 and then run fly deploy, which will create one Machine and one standby Machine again.

At a lower level, you can also use the fly machines update command with the --standby-for option to create a standby configuration using two existing Machines.

Turn off redundancy on deploy

What if you don’t want to keep Machines running all the time for an app with a small or variable workload? The automatic start and stop feature is enabled by default for Machines and makes it possible to run at least two Machines without wasting resources.

If you still don’t want fly launch or fly deploy to create two Machines, then you can use the --ha option to turn off this feature.

Create one Machine for a new app:

fly launch --ha=false

Create one Machine on first deploy or when your app is scaled down to zero Machines:

fly deploy --ha=false

Scale down an app to one Machine:

fly scale count 1

For apps with more than one process, refer to Scale a process group horizontally.

Automatically start and stop Machines

Automatic start and stop works for process groups with services defined. This feature enables you to have multiple Machines that only run when needed.

Fly Proxy can start and stop existing Machines based on incoming requests, so that your app can accommodate bursts in demand without keeping extra Machines running constantly. And if your app needs to have one or more Machines always running in your primary region, then you can set a minimum number of machines to keep running. Fly Proxy is also responsible for load balancing.

Default settings for new apps created using the fly launch command: automatically start and automatically stop Fly Machines, and minimum machines running is zero.

Default settings for some existing apps (or any apps that don’t have these settings in fly.toml): automatically start but don’t automatically stop Fly Machines.

Get all the details about automatically stopping and starting Machines.

Health check-based routing

Fly Proxy routes network connections away from Machines that are failing health checks.

Health check-based routing works for process groups with services defined and is based on the configuration of services.tcp_checks or services.http_checks in your app’s fly.toml file.

If the configured health checks are failing for a Machine, then the proxy doesn’t route network connections to that Machine. The proxy routes connections to a healthy Machine instead. If there aren’t any healthy Machines, then the connection will block, waiting for a Machine to become healthy.

Health check-based routing doesn’t work for the top-level health checks defined in the checks section of fly.toml, because the top-level checks don’t apply to specific services.