The Big Little Guide to Running Code in the Cloud(s)

There are now many ways to run code, from traditional servers to the new serverless to cutting-edge edge options. This guide evaluates many of the code execution offerings available across AWS, Google, Azure, Heroku, Fly and Cloudflare.

As software developers, we've come a long way... from fighting over limited time slices of lab computing, we've arrived at being able to instantly deploy and run code on servers in hundreds of edge data-centers in cities all over the world. This guide evaluates some of the cloud computing options that have become available over two decades—we'll cover the offerings from the three major cloud providers, along with a few other companies building interesting options.

The common stages of application development are writing code, packaging code, deploying and running code. In this guide we'll talk about advances in packaging, deploying and running code. We'll also cover the basics of the infrastructure required to connect our applications to the rest of the internet, like load-balancers, request protocols and routing. We won't be looking at databases, data storage, message queues, and other advances here—these are all massive topics that deserve their own separate guides. We'll be looking at three kinds of service providers: IaaS or infrastructure-as-a-service, PaaS or platform-as-a-service, and FaaS or functions-as-a-service.

An IaaS, or infrastructure-as-a-service system gives you all the basic building blocks necessary to run your application, and asks you to make all the decisions about how your application should run—and also do all the work required to setup, configure, maintain and monitor it. The building blocks might have some loose integrations with each other, but you'll have to do most of the wiring to make them work together.

A PaaS, or platform-as-a-service system makes a lot of those decisions for you, and does a lot of the work itself. The servers are already configured correctly and high-performance request routing and monitoring is taken care of. Operational maintenance is also handled for you, with the PaaS employing a highly trained team of engineers on a 24/7/365 pager duty schedule to make sure everything is running according to your choices. All you need to do is submit the application, configure it, and make a few strategic choices about how you want things to work.

A FaaS, of functions-as-a-service system, works very differently. Your application is not a stand-alone system here—it's a smaller piece of code that fits tightly into a much larger framework offered by the FaaS system. It cannot run independently—it needs to follow the technology, dependency, packaging, deployment and execution rules of the FaaS, and it can run only in the context of that FaaS. In exchange for conforming to these tight rules the FaaS offers you something that's very hard, if not impossible, to achieve using any other kind of system. We'll cover what “magic” a FaaS provides, along with what the tradeoffs are and how to reduce their impact on your work.

Simple IaaS

The simplest way to deploy an application on the cloud is to rent one server and deploy it there. AWS has EC2, Google has Compute Engine—Azure doesn't bother with a marketing department and just puts everyone in sales, so they just call them Virtual Machines. As Azure very helpfully points out, these servers are all virtual machines. They're real servers with CPU, RAM, disks and networking installed in a data-center building somewhere, and they're either rented out to you as a whole, or split into smaller sections using virtualization hardware and software—hence the term “virtual” machines. This splitting is invisible to you, the customer—you'll see each virtual section that you rent as a complete, isolated, fully functioning machine. There's a lot of incredible engineering that goes into doing this securely and with high-performance—if you're interested the AWS Nitro System is worth reading up on.

Your server needs to be reachable on the internet, so the first thing it'll have is an IP address. These are a pretty scarce resource, though—represented as four numbers from 0 to 255, there are only 4,294,967,296 (256 x 256 x 256 x 256) possible addresses, out of which fewer are usable and available. So the cloud providers will usually either give you any available address, or allow you to reserve one—often for free if you're actually using it, but at extra charges if you're squatting without using it. AWS calls this Elastic IP, and Google calls it “Reserving a static external IP address”, because they're probably in the process of moving everyone from marketing into sales. You probably already guessed what Azure calls it. There's also support rolling out for a newer larger address format called IPv6 that's four times longer and has more addresses than all the grains of sand on a billion earths—but that will take a little longer before it's fully adopted. We'll revisit IPv6 addressing again when we talk about more interesting deployment options.

Once we have an IP address, we log into the machine using SSH, install our code package, which can be as simple as a zip file for interpreted languages (like Ruby, Python or PHP), or a specially formatted zip file like JAR or WAR (Java & JVM), or a compiled binary (Go, C++). We then complete the process of deployment by installing any dependencies or language runtimes we need, and run the application with a process manager so it stays running after we log out or restart.

We can then point a domain name, like www.example.com, at the IP address of our server has using the name-servers provided by company that we registered the domain name with. Or we could use the DNS name-servers offered by the cloud providers, which promise better performance all over the world. AWS has Route53 (named after the default port for a DNS name-server, which is 53), Google has Cloud DNS and Azure just has DNS. Using the cloud providers is helpful because they've got servers deployed all over the word, and DNS speeds make a big difference in how fast your application works for users everywhere. Route53 has a few advantages here—besides just answering the question of which IP address is serving a particular domain name, it also gives you the option to tailor that answer based on where the user is, letting you send them to a deployment that's close to them. This assumes that you have a global application that's deployed in multiple places across the world, of course. Cloudflare is also an excellent DNS provider—they also have a deployment offering that we'll cover later.

Along with some extra work that I really hope you do, like OS firewalls and OS hardening, this completes a basic application server setup on the cloud. The cloud providers helpfully provide services to manage basic network security around your server, like AWS VPC, Google VPC, or Azure Virtual Network. All of these will allow you to selectively open and close ports with allow/block rules that be scoped to IP addresses; or scoped to another group of similarly managed servers (often called a security group). There are also pre-hardened cloud-specific OS distributions like Amazon Linux available for free use.

To see what's going on with your application, or monitor it, you could always log in to the machine and read, or tail, the logs. But there's another building block that cloud companies will give you—a tool to ingest logs, set up alerts and dashboards on them and archive them if necessary. AWS CloudWatch, Google Cloud Logging, and Azure Monitor all provide these features, usually charging based on the quantity of logs you ingest into them and how long you hold them for. Each of them have different features, but as a starting point it'll make sense to just use whatever is available with your chosen platform. Each platform's logging tool will usually also handle native metrics for you—you'll be able to make a dashboard of server CPU usage, disk utilization, or any other metric that the cloud provider tracks natively. If you need more than this, companies like Datadog, Sumo Logic, New Relic, Elastic and many others can also help.

Protocols & Encryption

Before move on to more complex deployments, it's important to review what kinds of incoming requests our applications might be dealing with in the first place. The common protocols we use are:

UDP, where packets of information are sent to an IP address with no expectation of acknowledgement.
TCP, which builds on UDP to create the concept of a connection, where the sender and receiver work with ordered packets of information that are all acknowledged—this gives the sender and receiver the impression of having a 2-way reliable sequential stream on which to exchange raw bytes.
HTTP/1, which builds on TCP to add the concept of a request and response. A request is sent to a human readable address, called a URL, with headers that act as key-value metadata and an optional body of bytes. Each request gets a response, with the same format of headers and body. Each request & response runs on one TCP connection.
WebSocket, which builds on TCP to provide a similar 2-way ordered data stream, but instead of raw bytes the protocol allows for distinct messages to be sent. Each message is either received as a whole, or the connection is considered to have failed. WebSocket is a companion protocol to HTTP, and a HTTP connection is often transformed (technically upgraded) into a WebSocket.
HTTP/2, which builds on TCP and offers the same features as HTTP/1, but with better performance, connection multiplexing (more than one HTTP request-response exchanges can happen simultaneously on the same TCP connection), header compression, and preemptive data pushing from the server.
gRPC, which builds on HTTP/2 to offer remote procedure calls—a way for clients to call functions on servers, identified by URI, with an input and output exchanged in predetermined binary formats.
HTTP/3 + QUIC, which builds on UDP and offers the same features as HTTP/2 but better.
TLS/SSL, which isn't a standalone protocol, but a way to encrypt and secure TCP and all the protocols that are built on it (which is all of them except UDP). This is pretty much a basic necessity these days—most browsers will make users deeply ashamed of visiting your site if it's not encrypted, and some mobile OSs will just flat-out refuse to make a connection with an non-encrypted server.

It's also worth re-visiting how trust-based encryption on the web works: the foundation of the secure web is a small group of (hopefully) very competent companies, called certificate authorities, who publish public keys that every major browser and operating system decide to trust. When we want to use TLS, we generate a private key and public key pair, and then convince one of these companies that this key pair is going to be used on one particular domain name that we control. After we provide enough proof, and sometimes money, the company will issue a certificate that says they trust that we will indeed use that key-pair on that particular domain name. We present this certificate, which includes our public key, to all clients to earn their indirect trust—and since we have the private key accessible to our application servers we can then enjoy fully encrypted connections with the TLS protocol.

Complex IaaS

No server that we build, own, or rent from a cloud provider is ever a perfect machine—at some point something is going to fail. And even if we don't care about failures too much, we're often in situations where we want to run the same application on multiple servers—if not as backups, at least to handle more load than a single server can manage. At this point our deployment starts to get a little more complex—we can set up multiple servers easily enough, but we now need a new system to load balance (evenly spread) and fail-over (ignore the servers that aren't working) incoming requests between our servers.

The system that we use to balance requests between our servers must support balancing the protocol(s) that we're using in our application. This is actually pretty difficult to do correctly—because the system is going to be acting as a proxy, it needs to be as transparent as possible: both in terms of performance and modifications to the requests and responses. It should be so fast and meddle so little with the connections that our applications and users experience no problems from using it.

We also want this load-balancing system to handle TLS/SSL termination. Our applications may be able to handle encryption/decryption themselves, but we don't actually want the private key accessible to our application servers—that's a security nightmare. We also don't want to worry about making sure our chosen language or framework does a good & fast job encrypting/decrypting connections—especially if we chose the language for its strengths in other areas. With TLS termination, the load-balancer handles the encryption/decryption of all the connections, forwarding just a normal non-encrypted connection to our application. This way, our application servers don't need to be concerned any of this stuff—the work is contained in the load-balancers that know how to do it fast, securely, and well. Lots of cloud providers and load-balancers will also provide automated certificate management, usually integrating with the free Let's Encrypt authority. AWS is also an authority themselves with their free ACM service.

Open source tools like Nginx, HAProxy are available to do load balancing, along with specialized hardware from companies like F5—each of these systems will support all or some subset of the protocols and features above. But here we're more interested in how the cloud providers can help us handle this, so let's jump to that:

Network/L4 Load Balancers

The most transparent load balancing system on AWS is the Network Load Balancer. You pay a base monthly charge, and AWS will set up, manage, and scale a fleet of servers that load balance at a UDP and TCP level, called Layer 4—which means that all protocols built on TCP & UDP will also work. The NLB is known to be very fast and is so transparent that you literally won't know it's there—your application will see the IP address of your clients as the origin of the connection as if there wasn't a load-balancer in the middle. The NLB does TLS termination as well, integrating with the ACM to automatically issue and manage your certificates. The service will automatically scale up and down, charging you a multiple of about half-a-cent every hour, based on the maximum of three metrics—either new connections made, or existing active connections, or data transferred. Google also has a network load balancer that behaves similarly to the one AWS offers, but as of this writing it does not support automated certificate management. You'll need to get your own certificate, upload it, and keep it renewed periodically. The Azure Load Balancer works similarly as well, except it doesn't seem to support TLS termination at all.

Application/L7 Load Balancers

The next kind of load balancing system works at the HTTP level, or Layer 7. These systems actually read the data in the incoming requests, assuming that they're HTTP (or based on HTTP), hold the connection, and initiate a separate request to the application with the same data mapped on to it. While this results in some transparency loss—the process is slightly slower, and the source IP address you see is now the IP address of the load balancing server—there are also many advantages:

the load-balancer can now multiplex multiple domain names to applications, by examining the host headers and deciding which application this request should go to.
it can route requests to different applications or modules of an application depending on the path of the request address.
it can apply a rules engine to do advanced functions like redirects, checking for the presence and validity of authorization tokens, or anything else you might want across all requests and applications.
it can compensate for the loss the source IP address transparency by injecting it into the request headers, usually in X-Forwarded-For.

One caveat with L7 load-balancers, though, is that protocol support doesn't flow upward like it does with TCP & network load-balancers—they need to have separate support and handling for HTTP/1, WebSockets, HTTP/2, gRPC and HTTP/3. These protocols share a common ancestor in TCP, but understanding one does not guarantee that a system will understand any of the others. Each L7 load balancer will support a subset of these protocols.

The AWS service offering this kind of L7 system is called the Application Load Balancer and it's priced similarly to the NLB, but a little more expensive. It supports incoming connections in HTTP/1, HTTP/2, WebSockets and gRPC, with the option to proxy them forward to your application in the same protocols (after optional TLS termination) or downgrade HTTP/2 to HTTP/1. This way if your application framework only supports HTTP/1 you can still enjoy some of the benefits of HTTP/2. It also supports routing rules based on host, path, method, query parameter, source IP and arbitrary headers; and can automatically validate authorization tokens with a bunch of identity providers.

Google's Cloud Load Balancing system supports HTTP/1, HTTP/2, gRPC and also HTTP/3+QUIC—with host and path based routing. It also has the advantage of being global—it publishes a single IP address that Google's servers worldwide respond to, and requests are forwarded to the nearest configured application server. It provides TLS termination and has options for both automated and self managed certificates. There's also deep integration with Google Cloud CDN.

The AWS answer to the Google's load balancing being global is the Global Accelerator—a kind of worldwide load-balancer of load-balancers that gives you two IP addresses that will route to many AWS servers all over the world, which will then forward the connection to the nearest configured application server or NLB/ALB. On the CDN side, AWS CloudFront works great by itself, but doesn't have much integration with either the ALB or Global Accelerator.

Azure answers all this with Front Door, which does a little bit of everything—support for HTTP/2, but not WebSockets, no word on gRPC with routing based on hosts and paths.

Automatic Scaling

Once you've chosen a load balancing system, the complementary decision is if and how to automatically change the number of servers you're running, based on some inputs. The first step do doing this is to figure how to automate setting up a new server. The manual way to do it is to SSH into it and run the commands one by one, and one way to automate that would be to write those commands into a shell script that can be run on any new machine, performing all the necessary setup and configuration. The cloud providers will allow you to specify this shell script as a launch script, automatically running it when a new server boots up for the first time.

There's also often the option of taking a fully functioning server and treating it as a prototype of sorts, by cloning its root hard drive as is and then running each new server with a clone of that drive. There are also widely available configuration and installation tools—any method will work fine, but we need to choose one.

Once we've figured out how a new server is going to be setup automatically, AWS, Google and Azure then offer to automatically add new servers when certain rules are met, and automatically attach those servers to your configured load-balancer. This means that when the work your application needs to do increases, you don't need to get involved in any way—the cloud control systems will automatically spin new servers up to handle the extra load. The most common rule is CPU utilization, where you say new servers need to be added if the average utilization over some number of minutes goes over some limit, say 75% over 5 minutes. There are other triggers and rules possible as well, like those based on memory utilization, network bandwidth—or even rules based on external factors, like the time of day, number of pending messages in a queue, or signals from any other system.

You'll always be able to set a limit on the expansion of your servers, so you won't be shocked by a massive bill later. The reverse of the scaling operations is also available and can be automatic—servers can be removed when the CPU utilization drops below 30% for 10 minutes, for example, or any other rule that's usually the inverse of what you set up in the scaling up operations. It's also normal to set a minimum limit as well, because going down to zero will usually mess with the metrics you're using to scale up and down—not to mention make your application completely unavailable.

The rule of thumb I use for web applications is to add servers when CPU utilization is 70% for over 5 minutes, and drop servers when it's lower than 30% for 10 minutes—but there are infinite ways to set this up, and it's very dependent on your application. For servers that process jobs I hook up a rule to add more servers if the number of messages in the queue goes over a few hundred, and drop servers if there are less than hundred messages. The minimum is usually two servers, and I set the max according to what my database can handle.

Packing Progress

The options we've covered up to this point are generally referred to as IaaS, or infrastructure as a service. These providers give you the option to rent servers and a few other services, but how they're used and set up in entirely up to you.

One the earliest options to go one step further was Heroku—initially built to host Rails applications, Heroku took the Rails ideas of convention over configuration and applied them to server deployment, promising that if you had a conventional Rails application they would use that knowledge to handle packaging and deployment tasks for you. You just had to give Heroku access to your code folder or repository, and it would package the code into a simple custom format (called a slug) on Heroku servers, and run it for you on fully managed virtual servers called dynos. An environment variable called PORT would be injected into your application, and you just needed to serve your requests on that port. A common load-balancing and routing layer would handle all requests and send them to your app, so there was no separate load-balancer to configure—you could simply choose the number of app servers you wanted to run and it would happen. All the logs your application wrote to STDOUT (standard output, the default Unix destination for log / print statements in your code) would be automatically captured and consolidated for you in a dashboard and CLI.

This form of simplified and opinionated hosting became known as a platform-as-a-service, or PaaS. Heroku then branched out to other application stacks and framework by defining the packaging format, the slug, a little better. A series of shell scripts, called buildpacks, were then written to build many common application types into the standard slug format. Assuming your code ran correctly on the same underlying OS and dependency list, or stack, that Heroku used, it could just build a slug with a suitable buildpack and deploy it.

Other attempts were made to standardize the Heroku slug format, but it never caught on broadly. Luckily, there was a different packaging format being developed that would go on to revolutionize how applications were packaged and deployed. When packaging code ourselves, we write scripts and commands to install the full set of dependencies that we needed to run our application, install the application itself, and set up the process manager. Each of these steps is error prone—versions can change, network downloads might timeout, and the dependency environment isn't guaranteed to be the same as our development machines, or the runtime environment might change from one application server to the next depending on whether some OS patches are installed. Heroku solved some of these problems by vendoring—building the slug downloaded all the dependencies and binaries an application needed into the folder that the code resided in. But it was still a relatively incomplete solution, because it didn't deal with the OS and compatibility problems between different development and deployment environments.

A solution to standardize packaging completely was introduced with Docker, using containers, which offered a way to fully standardize the OS, patches, dependencies, application code and configuration setup—all in a way that a container could be created right on the local development machine. A container created this way would run exactly the same on any server—with clean options available to configure each environment and hook up foundation services like log collection. Since the introduction of Docker there have also been other advances and standards that have built on this idea, including a way to make containers from buildpacks—so now we have a full ecosystem of options for packaging code. When we package code into containers, they can run on any platform with no regard to the underlying OS stack.

This is a big deal, and it opens up a lot of deployment options. If you're trying to deploying a full package like a container, cloud companies can now get creative in how they can then take that package and run it for you. This gave rise to the idea of orchestration, which allows you to use your severs as one giant pool, or cluster, of computing capacity—and run as few or as many containers as you need for each of your applications, in that cluster. An orchestrator can handle your deployments, keep track of which servers have which containers running on them, which servers have capacity available to run more, where the best place is to run the next container, and the list of containers and their IP address / port combinations that are running as part of each application. These orchestrators can then provide their own load balancing systems, or integrate with the existing load balancers on offer. This entire containerization & orchestration ecosystem allows you to abstract away the concept of a server—it doesn't matter what OS or version or software is installed on each server, because any server that has the orchestrator's agent installed is now a source or raw computing power that and be added into your resource pool. Your applications will bundle the OS and dependencies of their choice along with code, so they don't care at all about where or what they're running on.

ECS

The Elastic Container Service was one of first orchestrators that became available, specific to AWS. It fully integrates with the AWS ecosystem, including pulling your container images from the AWS-native ECR repository, auto-scaling, attach/detach to the ALB/NLB load-balancers, logging integrated with the CloudWatch logging system and more. ECS works by asking you to install an ECS agent on your pool of servers, and using that agent to manage the containers running on them. There's also a fully optimized OS image available with the agent preinstalled. There's also the option of installing the agent on other servers anywhere else that you might control, seamlessly adding that capacity into your ECS cluster.

You can then setup services and tasks on ECS. Services are continuously running containers that can serve web requests or do continuous work, while tasks are periodic or one-time runs of a container that are expected to do some work and then shut down. ECS will intelligently spread your services and tasks over all the servers in your cluster using strategies that you can customize. You can apply auto-scaling policies to your services, and have them attached automatically to load-balancers. With plain ECS, you need to make sure that your cluster is large enough to run all the containers you want—the simple way to do this would be to run an auto-scaling fleet of servers based on the ECS optimized OS image that are instructed to join your cluster in their launch script.

Kubernetes

Initially developed at Google, Kubernetes is now a very popular container orchestrator that you can run yourself on pretty much any cloud, with fully managed service being provided by the big three via EKS, GKE, and AKS. Your underlying servers that provide capacity are called the cluster, and each application unit that you run is called a pod. Containers are generally considered disposable and stateles_s—any important data you have is not stored on the container file system itself, but in the database or any common storage system. Kubernetes has separate provisions for _pods that do in fact store information locally, which makes them stateful. As with ECS, you'll need to make sure your cluster is adequately supplied with underlying servers, and each engine will have its own methods to do this.

Servers are so PaaSé

Once we have cloud-managed orchestrators running our deployments on our server clusters, the obvious question arises—why do we need to manage these underlying servers in the first place? If these orchestrators are running on the cloud and the cloud companies are managing all these servers anyway, why not just integrate server management? That's exactly what AWS and Google have done with AWS Fargate and Google Cloud Run. These services allow you to set up your applications, specify how much CPU and RAM each container needs, and just let the cloud provider figure it out. They'll treat their entire fleet of servers as one gigantic pool or cluster and give you exactly the capacity you need. They also apply their security experience to make sure your applications are completely isolated, even if they're all technically running in the same cluster.

Fargate

AWS Fargate runs on the ECS system, where you can choose Fargate as the basis of your cluster instead of adding & managing a group of servers with the ECS agent installed. AWS will handle cluster management while still making sure that each service and task works with the same network security rules and configurations as managing the underlying servers yourself. Fargate also has the option of using an interruptible cluster at a much cheaper rate. If you have applications that can tolerate its services being shut down and restarted, you can set up Fargate to use the spot market capacity provider with higher priority, and then the regular Fargate provider. The normal ECS integrations with the load-balancers and logging are available.

Cloud Run

Google goes one step further with Cloud Run, a service that fully integrates a load-balancer and router as well, so your application receives a URL and can start and stop as many containers are required, even doing a full shutdown if no requests have come in in a while. If a request comes into Cloud Run app while no containers a running, the special load-balancer / router will hold the request open while a container is quickly started, and then pass the request into your application. This allows you to run intermittent web services without having to pay when you're not serving any requests.

Heroku

Heroku has added support for Docker as well, so it has the distinction of being a full service PaaS whether or not your application is packaged using Docker. Containers are supported as is, while any app that can be built with a buildpack and can run on Heroku's stack is supported. Heroku also adds a lot of integrated functionality using add-ons—since they're not themselves a major cloud provider they encourage integrated third-party services over trying to offer everything themselves.

Fly

There's also another interesting upstart startup that does something similar to Google's Cloud Run, called Fly—upload your container and Fly will run it for you with a builtin load-balancer and logging. Fly has the added advantage of being global—you can choose to deploy your containers in any subset of the ~20 regions Fly has all over the world. When your users make a request to an app on Fly it'll land in the closest Fly region, and Fly will service it from the nearest running container—while also examining your traffic patterns and intelligently starting and stopping containers all over the world for you. Besides the obvious HTTP traffic Fly can also handle TCP, UDP and gRPC connections, so if your application benefits from running close to your users wherever they are, it's a great choice. Fly lets you choose the VM size your container will deploy in, and charges by the second for running VMs. It does not currently support scaling down to zero, but the base VMs are available for a very low price—and since the load-balancer is included it's pretty cheap to run an application there. Unlike other PaaS systems, Fly also offers networking services between all running instances of your container, so applications that need to communicate internally can work especially well. IPv6 & Wireguard are used to great effect to build a private network of your running containers that you can also connect to from outside of Fly.

Google App Engine

App Engine was the original Google PaaS, active well before the rest of the Cloud systems. App Engine supports a lot of language runtimes, as well as a Docker based flexible environment. It comes with heavily integrated load balancing, automatic scaling up and down, datastore servcies, and integrated logging. It also injects useful information about clients into the the incoming HTTP request headers, like approximate location, city, state and country.

FaaSt & Furious

Alongside the advances in packaging and orchestration, cloud companies have also been working, in parallel, on a another way of running code—you give them the code, and they run it. That is, of course, assuming you make sure it follows a particular API, uses a particular technology stack, is packaged a particular way, and operates correctly within strict limits. If your code can twist itself into all these requirements, the cloud will do something magical with it in return: it will run your code exactly when necessary, charge you for exactly how much time it runs, and do it instantly at any level of scale you can imagine—all without you ever having to think about the word ‘server’.

The cloud companies do this by first creating a smart event router—when an event comes to the router and needs your code to process it, the router will see if your code is already running in a slot. I'm making up the word slot here—I have no idea what they're called internally. But these companies join the router with custom orchestrators that create a huge pool of secure, isolated execution slots on thousands or millions of their machines—when an event comes to the router and needs to be processed, the router looks to see if any of these slots is running your code. If none are, your code is loaded up from common network storage and initialized in a slot. The event is then passed as an input to your code, and after it finishes executing the output is routed to wherever it needs to go. Your code may be kept running in the slot for a while, in the hope that it might be used again soon, or the slot might be cleared and used with some other code.

Now you can see why the companies place such stringent restrictions on your code—because the service has no idea which slot your code is going to run in, all the slot servers have the same OS, and same short list of standardized dependencies and fixed language runtimes preinstalled. All the slots follow exactly the same input-output API, where the code must receive the input event in a certain format, and emit the output in a fixed format as well. Your code package has to be small enough to quickly load up, over the network, into any slot on any server. It needs to finish execution fast, because it's incredibly hard to fairly schedule slots across all the customers or plan capacity changes if your code is going to occupy slots indefinitely.

But this also means that you can write your code with absolutely no worry about scale or servers or anything of that sort. Your code can run as seldom as necessary, like a report once a year, or millions of times a second, like global order processing on a holiday. You'll still pay for exactly the amount of time (in milliseconds) that your code is actually executing for.

This makes for some very interesting use cases—while the benefit to code that runs intermittently is obvious, there's also a use case here for code that runs in a very spiky way—like a popular play that opens ticket sales every Thursday morning at 10AM, or a popular news outlet unsure of when something big might happen. Most applications in these scenarios need to be provisioned, or scaled up, to handle their peak traffic even when they're not actually doing so—because when they do need to suddenly handle the extra load there might not be time to add more servers.

This has also allowed a new cloud-native companies to maintain so much accounting rigor that they know exactly how much it costs to do each and every one of their pieces of work across every customer—and therefore have complete and up-to-the-minute knowledge of their revenue, costs and margins at all times. The nature of FaaS billing, where you're charged exactly per operation, is particularly conducive to this—especially when paired with databases that work the same way. And they can do all this without a separate on-call ops team.

As far as performance goes, one thing to remember is each slot can only serve one request at a time in FaaS systems, even if your language and stack are multi-threaded. The CPU that you slot runs on might have multiple cores, and you may be able to utilize all those cores while processing a request, but each slot will only process one request at a time. This isn't usually a problem—the promise of a FaaS is being able to use as many slots as you need automatically. The flip side is that each new slot coming up needs a warm up time—you'll remember that you code might have to be loaded over network storage, unzipped and initialized. Some FaaS systems might offer to always keep a few slots warmed up for you for an extra charge.

Lambda

AWS Lambda is one of the most popular FaaS services. Initially the request router could only handle events from internal sources, like the AWS SQS message queue—but integration with other services like the ALB load balancer or API Gateway now allows these services to convert incoming HTTP requests into events that Lambda can work with. Once your code processes this HTTP request event, it gives out the response as event data as well, which the integrating service then coverts into a HTTP response—on the connection that they've been holding all this while in a waiting state. This way your clients can't tell the difference between you serving requests off a Lambda or a regular server / container. As you can imagine, this approach only supports HTTP requests—although AWS has other related services that provide WebSocket support controlled via API calls. Lambda can keep a set number of slots warmed up at all times for an extra charge.

Google Cloud Functions

Google Cloud Functions behave very similar to the AWS Lambda system, but they have much better first-class integration with HTTP triggering mechanisms—there's no need to set up any other service to use them to respond to HTTP requests, like there is with Lambda.

Azure has Functions too.

Cloudflare Workers

Besides being an excellent DNS and CDN service, Cloudflare also recently introduced a global compute service, called Workers. Since they already have servers deployed in hundreds of cities worldwide for the CDN and DNS, it makes sense to offer them up for computing work as well. The normal approach is useless here, though—when running a large number of deployments, each individual deployment of servers isn't very big—definitely not a full data-center. This means that giving customers the option to run full-blown applications or containers isn't feasible—there simply wouldn't be enough CPU, RAM and performance for everyone. What Cloudflare did instead was use the V8 Engine that powers the Chrome browser and NodeJS. V8 has a way to run arbitrary code in a secure sandbox called an isolate. Using this feature allows Cloudflare to run compute workers for many customers without paying a process or memory penalty. Isolates are also extremely fast, and since the code is just JavaScript, or other languages compiled to WASM, loading it up is very quick—which is probably helped even more by Cloudflare likely holding all code in the same fast storage as CDN data. Cloudflare also mitigates the event format problem by using the standard Service Worker API. This means that if you write JS (or compile WASM) code implementing the Service Worker API, you can upload it to Cloudflare, and they'll store a copy in all of their edges all over the world. When a request comes in, it's serviced by the edge closest to the user, and they can load up your code and execute it to return responses directly from that edge. This helps you write applications that can return responses within low tens of milliseconds to almost anywhere in the world, which is impossible in a regular single deployment because the speed of light is so slow.

Downside Mitigations

There are three big downsides to FaaS systems, the first of which is the warm-up time. Lambda has options available to keep warm slots running at all times, and there's constant work going on in every service to cut down the startup and initialization times. Any conversation about performance is probably going to be out-of-date immediately, because the companies are working every day to make things better. The good news is that this problem will get smaller and smaller automatically as the companies figure things out.

The second problem is that of the limited stack and dependencies available. Lambda has a lot more options here, with the Layers feature allowing you to add arbitrary binaries into your deployment, helping support languages and runtimes not on the official list. Lambda is also going one step further with support for containers—subject to choosing one of the predefined base images. It's possible Google might do something similar, but this is probably about as far as the technology can be pushed, given its fundamental conceptual limits.

The third problem is the need for a custom event format. If you're writing new code from scratch and want to use them only on your FaaS of choice, this might be fine. But if you're more used to stand-alone applications, or want to also have the option of a regular deployment off the same codebase, things get a bit more complicated. If you're using a language like Ruby, your application likely runs on the Rack protocol. This allows a tool like Lamby to work as the Lambda event handler, internally converting it into a Rack request, having your application handle it, and converting the outgoing Rack response into the Lambda response event format. If you're stuck with a tech stack that can't do this—where the HTTP server is part of the application itself—a tool like Up can work—it runs your application as a process and places itself as the event handler. It then converts the event into an actual HTTP request and sends it into your application, doing the reverse with the response. It adds a very small latency to every request, but that's so small it's hard to measure. The idea of converting a HTTP request to a FaaS event and then to a HTTP request again, handling it with the application, and then converting the HTTP response into a FaaS event and then back to HTTP response again might sound weird, but if you want to do it you can. But none of this will work for non-HTTP protocols, of course.

More Money, Less Problems

There's a lot of discussion, and often arguments, happening over which kind of system is the the most cost effective way to run applications. The sage advice, as always, is “it depends”. The problem with trying to find the answer to this question is that saying “Lambda is cheaper than running your own servers” and “running your own servers is cheaper than Lambda” are both completely true—depending on what kind of application and request load you're looking at.

One we to think about this is understand that cost comparison is not a single-dimensional problem—it's always multi-dimensional, not even two-dimensional. Let's say you draw graph of cost of service vs requests-per-second load, for an IaaS system and a FaaS. Immediately you'll see that the cost line of an IaaS starts above zero and goes up in steps (as you need to add more servers one by one), and the FaaS line starts at zero and goes up perfectly linearly, with a constant slope. Check out this amazingly beautiful and accurate drawing below, that I made myself:

   ▲
 ┌─┴──┐
 │Cost│                                                        *
 └─┬──┘                                                    ***       ┌────┐
   │                                   ┌─────────────────**──────────┤IaaS├─
   │                                   │             ***             └────┘
   │                                   │          ***
   │                                   │      ***
   │                                   │  ***
   │                                   │***
   │                               ****│
   │                           *****   │
   │                        ****       │
   ├────────────────────****───────────┘
   │              *****
   │       ┌────┐**
   │       │FaaS│
   │   ****└────┘
   │ ***                                         ┌──────────────────────┐
   *─────────────────────────────────────────────┤ Requests per second  ├─▶
                                                 └──────────────────────┘

It's immediately obvious that the FaaS or IaaS can both be cheaper, depending on what request-per-second value you're looking at. And keep in mind that this is just two dimensions—requests-per-second is plotted on the X-axis here, but how do you plot the ability to scale to millions of requests when there's a unexpected breaking news event, or a flash sale? How do you plot not having to wake up at two in the morning, when you've just put the baby to sleep, to clear logs because one of the cron jobs didn't run often enough and one server out of fifteen has a full disk?

It's not even possible to start a conversation unless you understand the (multiple) cost graphs for each kind of system, plotted against each of the X-axes that you care about. There are some very rough guidelines you can use, though:

On an IaaS, you'll see costs rise in steps, as you add each server. Each server is a discrete and atomic chunk of cost, and whether enable auto-scaling or not, you'll move up and down in steps based on your chosen server size. This means that you'll always have some wastage, which is the area under the steps that runs even when you don't need it. Also, auto-scaling systems will almost always scale linearly, even if the request load has suddenly jumped exponentially, so if you have a spiky workload you'll often miss serving some requests as well. You also have a lot of work to do in setup, ops and maintenance, so you'll need to account for that too, in terms of salaries, time, effort, opportunity cost and hiring difficulty.

On a PaaS, you'll see a lot of the similar costs—the absolute dollar values will likely be higher, so whether this is worth it for you depends on how well you factored in other dimensions in the IaaS line. If you factored in the dimensions like your time, the ops team salaries and generally having a life, it becomes easier to judge accurately. PaaS costs tend to be very similar in shape to IaaS costs, but higher because they handle more of the setup, maintenance and ops work for you.

FaaS pricing lines will tend to be very different, and will exactly track your X-axis values, whether you've chosen to plot against requests-per-second, bytes-transferred, CPU-used, or anything else. One distinctive feature is that they start at zero—so for very small values on the X-axis a FaaS will almost always be cheaper. FaaS systems can also handle very heavy load almost instantaneously, which depending on your use case may be priceless, and impossible to achieve with an IaaS or PaaS. They also have the advantage of being completely maintenance-free, with the providing cloud company also constantly improving service quality. But if you have a steady load, and it's always higher than the baselines required to support a basic IaaS or PaaS deployment (load balancer monthly fees + 2 servers), maybe a FaaS isn't a good idea—but depending on how highly you value ease-of-use it might still be.

So what's best option? As always, it depends. There are no answers anywhere, other than at the end of your own evaluations. You can contact me @sudhirj if you have a specific application you want an opinion for, or for any other questions, mistakes or disagreements.