I'm Will Jordan, and I work on SRE at Fly.io. We transmogrify Docker containers into lightweight micro-VMs and run them on our own hardware in racks around the world, so your apps can run close to your users. Check it out—your app can be up and running in minutes. This is a post about how services like ours are structured, and, in particular, what the term "serverless" has come to mean to me.
Fly.io isn't a "Gartner Magic Quadrant" kind of company. We use terms like "FaaS" and "PaaS" and "serverless", but mostly to dunk on them. It's just not how we think about things. But the rest of the world absolutely does think this way, and I want to feel at home in that world.
I think I understand what "serverless" means, so much so that I'm going to stop putting quotes around the term. Serverless is a magic spell. Set a cauldron to boil. Throw in some servers, a bit of code, some eye of newt, and a credit card. Now behold, a bright line appears through your application, dividing servers from services…and, look again, now the servers have disappeared. Wonderful! Servers are annoying, and services are just code, the same kind of code that runs when we push a button in VS Code. Who can deny it: "No server is easier to manage than no server."
But, see, I work on servers. I'm a fan of magic, but I always have a guess at what's going on behind the curtain. Skulking beneath our serverless services are servers. The work they're doing isn't smoke and mirrors.
Let's peek behind the curtain. I'd like to perform the exercise of designing the most popular serverless platform on the Internet. We'll see how close I can get. Then I want to talk about what the implications of that design are.
Close your eyes, tap your keyboard three times and think to yourself, "There's no place like
Let's Start Building
Once this is invented, you'll probably want to use it to optimize sandwich photos uploaded by users of your social sandwich side project.
The first tool in our toolbox is the virtual machine. VMs were arguably "serverless" avant la lettre, and Lambda itself literally stood on the shoulders of EC2, so that's where we'll begin.
Take a big, bare-metal x86 server sitting in a datacenter with all the standard hookups. Like every server, it has an OS. But instead of running apps on that OS, install a Type 1 (bare-metal) hypervisor, like the open-source Xen.
The hypervisor is itself like a tiny operating system, but it runs guest OSs the way a normal OS would run apps. Each guest runs in a facsimile of a real machine; when the guest OS tries to interact with hardware, the hypervisor traps the execution and does a close-up magic trick to maintain the illusion. It seems complicated, but in fact the hypervisor code can be made a good deal simpler than the OSs it runs.
Now give that hypervisor an HTTP API. Let it start and stop guests, leasing out small slices of the bare metal to different customers. To the untrained eye, it looks a lot like EC2.
Even back in 2014, EC2 was boring. What we want is Lambda: we want to run functions, not a guest OS. We need a few more components. Let's introduce some additional characters:
Placementservice, with an API, that can start and stop VMs across a pool of
Manageris a service with an API that tracks the VMs — we'll start calling them
Workers— running across that pool, and can tell us how to reach them.
Frontendhandles requests for things our
Workerswill actually do.
functionis the code the customer wants us to run. For your sandwich app, the
functionresizes and optimizes an image, and sends it to an S3 bucket.
Frontend reads an
Invoke request for a
function we want to run. (Someone's just uploaded an image to your S3 sandwich bucket through your app.)
Frontend asks a
Manager to provide the network address of a Worker VM containing an instance of your
function, where it can forward the request. The
Manager either quickly returns an existing idle
function instance, or if none are currently available, asks a
Placement service to create a new one.
This is all easier said than done. For instance, we don't want to send multiple requests racing toward a single idle instance, and so we need to know when it's safe to forward the next request. At the same time, we need
Manager to be highly available; our
Manager can't just be a Postgres instance. Maybe we'll use Paxos or Raft for strongly-consistent distributed consensus, or maybe gossiping load and health hints will be more resilient.
We can straightforwardly run a
function instance on a
Worker VM. But we can't just use any old VM; we can't trust a shared kernel with multitenant workloads. So: give each customer its own collection of dedicated EC2-instance
function instances onto them. Boot up new
Workers as needed.
Another catch: it takes seconds or even minutes to boot a new
Worker. This means some of our requested functions have unacceptably (and unpredictably) high "cold start" time. (Imagine, in 2022, holding on to your excitement for over a minute waiting for your image of the local sandwicherie's scorpion-pepper grilled cheese to insert itself into your chat.) Have
Placement manage a "warm pool" of running VMs, shared across all customers. Now functions can scale up quickly. To scale down,
Manager periodically vacuums idle VMs, returning them to the warm pool for reuse.
Scale is our friend. We have lots of customers, so the warm pool smooths out unpredictable workloads, reducing the total number of EC2 instances we need. But we're not out of the woods yet. We can get huge spikes of consumption: say, an accidentally-recursive function. One broken customer brings everyone else back to cold-start latency. The easiest fix: soft limits ("contact us if you need more than 100 concurrent executions"). Beyond that, the service could adopt a token bucket rate-limiting mechanism to allow a controlled amount of sustained/burst scaling per customer or function.
We've sketched most of orchestration, but hand-waved the actual function invocation. It's not all that complicated, though.
Placement allocates enough resources on a
Worker, it can load up the
function instance there. Remember, it's still 2014, and Docker only just became production-ready, so we'll roll our own container the old-fashioned way. A daemon on the
- handles the function initialization request,
- fetches the application code .zip file from object storage (S3),
- unpacks it on top of a ready-made runtime environment,
- launches the function-handler process in a chroot,
- drops privileges,
- uses namespaces and seccomp profiles to run in Docker-like incarceration,
- enforces configured CPU and memory resource limits with cgroups,
- uses the cgroup freezer to ensure that idle functions consume no resources outside of active requests proxied to the function instance.
Iterating on the Design
We've come up with a relatively naive design for Lambda. That's OK! We're Amazon and we can paper over the gaps with money and still have enough left over to make hats. More importantly, we're out in front of customers, and we can start learning.
Fast forward to 2018. We made it. "Serverless" is the new "elastic" and it's all the rage. Now let's make it fast.
What's killing us in our naive design is Xen. Xen is a bare-metal hypervisor designed to run arbitrary operating systems in arbitrary hardware configurations. But our customers don't want that. They're perfectly happy running arbitrary Linux applications on a specific, simplified Linux configuration.
Firecracker is modern hypervisor built on KVM and exploits paravirtualization: the guest and the hypervisor are aware of each other, and cooperate. Unlike Xen, we don't emulate arbitrary devices, but rather virtio devices designed to be efficient to implement in software. With no wacky device support, we lose hundreds of milliseconds of boot-time probing. We can be up and running in under 125ms.
Firecracker can fit thousands of micro-VMs on a single server, paying less than 5MB per instance in memory.
This has profound implications. Before, we were carefully stage-managing how
function instances made their way onto EC2 VMs, and the lifecycle of those EC2 VMs. But now,
function instances can potentially just be VMs. It's safe for us to mix up tenants on the same hardware.
We can oversubscribe.
Oversubscription is a way of selling the same hardware to many people at once. We just bet they won't all actually ask to use it at the same time. And, at scale, this works surprisingly well. The trick: get really good at spreading around the load across machines to keep resource usage as uncorrelated as possible. We want to maximize server usage, but minimize contention.
Firecracker lets us spread load more evenly, because we can run thousands of different customers on the same server.
Workers are now bare-metal servers, not EC2 VMs. We need a warm pool of them, too. It's a lot of extra micro-management. And it's worth it. The resource-sharing shell game is way more profitable. Reportedly, Lambda runs in production with CPU and memory oversubscription ratios as high as 10x. Not too shabby!
There's a tradeoff to this. We've aggressively decorrelated our server workloads, shuffling customers onto machines like suits in a deck of cards. But now we can't share memory across functions, like the classic pre-forking web server model.
On a single server, a function with
n concurrent executions might consume only slightly more memory than a single function. Shuffled onto
n machines, those executions cost
n times more. Plus, on the single server, instances can fork instantly from a parent, effectively eliminating cold-start latency.
And now we have a network-sized hole in performance. Functions are related; they're intrinsically correlated. Think about serverless databases, or map-reduce functions, or long chains of functions in a microservice ensemble. What we want is network locality, but we also want related loads spread across different hardware to minimize contention. Our goals are in tension.
So some functions might perform best packed tightly to optimize performance, while others are best spread thin for more distributed resource usage across servers. Some kind of hinting along the lines of EC2 placement groups could help thread the needle, but it's still a hard problem.
At any rate, we have a design, and it works. Now let's start thinking about the ramifications of the decisions we've made so far, and the decisions that we have yet to make.
Ramifications for Concurrency
Lambda's one-request-per-instance concurrency model is simple and flexible: each function instance can handle one single request at a time. More load, more instances.
This works like Common Gateway Interface (CGI) of yore, or more precisely, like implementations of its successor FastCGI which reuse instances across requests.
Scaling is simple and straightforward. Each request is handled in its own instance, separate from all other concurrent requests. No locks, thread-safety or any other parallel programming concepts.
But handling concurrent requests in a single instance can be more efficient, especially for high-performance web application servers that can leverage asynchronous I/O event loops and user-space threads to minimize context-switching overheads. Google's Cloud Run product supports configurable per-instance concurrency. Lambda's design makes it harder for us to pull off tricks like that.
Ramifications for Pricing
If we're Lambda, we bill per-second duration based on memory use, with a per-request surcharge; like a taxi meter, we have a base fee, and then the meter ticks up as long as we're working.
Two ways of looking at the request fee. First, it's a fudge factor representing the aggregate marginal costs of the various backends involved in handling the request.
But if you're an MBA, it's also a way to shift to "pay-for-value" or value-based pricing, a founding tenet of Lambda. Value pricing says that you pay based on how useful the service is; if we figure out ways to deliver the service more cheaply, that's gravy for us. Without the surcharge, we're doing cost-plus pricing. You'd just pay for the resources we allocated to you.
(Remember, we're up to 10x oversubscribed. Customers are, on average, utilizing only 10% of the resources they pay for.)
We combine CPU and memory pricing to simplify duration-based pricing. Simple is good, but costs our users flexibility if they have lopsided CPU or memory-heavy functions. For that problem, there's Fargate, Lambda's evil twin.
This pricing seems simple! But it's actually a little bit complicated, if you are sensitive to cost.
Your image-cruncher function might be making good use of its resources for most of its running time. But what if a function process is actually really fast? It might actually skew cheap in resources and expensive in requests.
And now, you've added a function to periodically scrape the major socials for new pictures tagged with any sandwich, artisanal sandwich stockist, or vending machine known to your database. Or, better, say you're Max Rozen, doing uptime checks on every endpoint in your database. Now you're paying full whack for CPU and RAM usage the whole time you wait (up to 10s) for a response from each one, to, you know, see if it’s online.
The value-based pricing here hits the sweet spot for functions that a) run long enough per request to amortize the request cost, and b) make enough use of the provisioned resources, while they run, to justify paying for them that long.
Prioritizing nimble scaling, combined with instance-per-request and per-request billing, does set up a potential footgun for our customers. Don't DDoS yourself.
We're counting on the product as a whole to add enough value to keep less price-sensitive customers coming back, even far from the sweet spot.
Ramifications for APIs
The public runtime API to a Lambda function is the
Invoke REST API, which accepts a POST method specifying the function name and request "payload", and requires a signature with appropriate AWS credentials. This conforms to Amazon's monolithic, internally-mandated API structure, but practically unusable outside the API-wrangling comfort of the AWS SDK.
A cottage industry has sprung up around frameworks just to help you hook Lambda up to the web. Amazon built one of them into CloudFormation. Problem: too much YAML. Solution: more YAML.
The way out is embarrassingly simple: the runtime API can just pass HTTP requests directly to the function instance. Most of what "API gateways" do can be built into HTTP proxy layers. For the common case of web applications, an HTTP-based API eliminates a layer of indirection and plugs in nicely with the mature ecosystem of web utilities.
Ramifications for Resilience
Lambda's execution environment sets strict limits:
- on function initialization (10 seconds)
Invokeduration (default 3 seconds; limit originally 60 seconds, later increased to 5 and then 15 minutes), and
- zero guarantees around idle-function lifecycle (a function instance could get shut down any time it's not handling a request, and will shut down once every 14 hours.)
This tightly-scoped lifecycle is great for the platform provider. It helps workloads quickly migrate away from overloaded or unhealthy instances, and makes it easy to shuffle functions around during server maintenance and upgrades without impacting services. And what's good for the platform is probably good for most customers, too!
But it's not ideal for apps
- with expensive or time-consuming initialization steps
- or that depend heavily on dynamic local caches for performance
- or when you're just not sure how long a response might take.
One alternative is for the platform to try to keep servers up and running forever, but sometimes you just have to reboot servers to patch stuff. Another option to recycling VMs is live migration, sending a snapshot of the running VM over the network to the new server with as little downtime as possible. Google Compute Engine supports live migration for its instances and uses the feature to seamlessly conduct maintenance on its servers every few weeks.
WebAssembly extends the language-sandbox approach with a virtual instruction set architecture, either embedded within v8 isolates or run by a dedicated server-side runtime like wasmtime.
Fastly built its Compute@Edge product around WebAssembly/WASI. However, WASI is still young and evolving quickly. On the serverside, WASM's overhead doesn't pay its freight: there's as much as a 50% performance gap between WASM and native code, which makes virtualization look cheap by comparison.
How Did I Do?
I just designed a shameless knockoff of Lambda, the most popular specimen of the most serverless of serverless services: a fleeting scrap of compute you can will into being, that scales freely (not in the monetary sense) and fades into oblivion when it’s no longer needed.
This article contains no small degree of bias! There’s also no small degree of appreciation for the craft that goes on behind the curtain at AWS and other purveyors of "serverless" services.