Simple AWS Spot Pricing Infrastructure is Rarely Possible

There are three ways to pay for AWS instances. Firstly, and the most expensive, is on-demand. Ask for an instance and you get an instance. The cost per unit time is much more expensive than most other cloud services. The second approach is to purchase blocks of reserved instances, where the cost is comparable to that of the on demand pricing for other cloud services. This should be the first choice for anyone in the fortunate position of being able to accurately predict future usage over the next 12 to 36 months. The third option is to bid for spot instances, spare capacity that is auctioned from moment to moment to all bidders. When capacity is available at less than your bid price, the instances you request will run. When the price rises too high, those instances will shut down.

In practice most organizations have only a vague idea as to what their usage will be a few years from now, and always have unexpected needs on a day to day basis. Thus AWS infrastructure typically mixes all three approaches to payment: a core of reserved instances to serve a best guess at future capacity without waste, on demand usage for emergencies and small experiments, and spot pricing wherever it can be made to fit. All AWS spot pricing strategies are easy to describe: try to pay less for the computational resources desired. The actual implementation is almost never as simple as setting a bid price and walking away, however. Consider the following types of constraint on an application that runs on one or more instances under one or more Auto Scaling Groups:

Must serve at least X requests per second.
Must process at least Y files per hour.
Must execute a batch task at a specific time every day.
Must hold specific data in memory for the foreseeable future.

The commonality here is the requirement that a certain number of instances must be running at a given time. If paying via spot pricing, there is no guarantee that this will happen. Instances will be taken away as soon as spot price rises higher than the bid. The spot price market is not a true market, in that there is a maximum bid and price, ten times the on-demand price. As a result the market activity is highly degenerate - either the price is low, or the price is enormously high, and rarely spends much time between those two extremes. Setting the bid to the maximum to ensure uptime also tends to ensure that the cost is higher than simply paying for on demand instances; the amount saved during the period of low spot prices will be more than offset by the periods of very high prices.

The Hypothetical Unconstrained Application

The only type of application that can benefit from straightforward spot pricing for all of its instances is one that is totally unconstrained. In other words, for which it is unimportant when it runs, or when it completes its tasks, or that it is running at any given moment in time. There are applications that fit this description, but they are few and far between. Sooner or later the spot price market for a given instance type will run hot for longer than an organization is willing to put up with the spot priced application being offline. An application exists because it must achieve some goal - otherwise why would an organization stand it up in the first place? Those goals near always come accompanied by one or more of the constraints on time and activity that make it hard to run under spot pricing alone.

Highly Asynchronous Applications

Many forms of application have only weak constraints on when exactly they must complete a given task. Consider queue processing for static content generation for a very large number of files that must be updated on a daily basis, for example. The application must accomplish a great deal of work, considered in total, but it doesn't really matter how long it takes to process any particular update so long as everything does in fact complete on a daily basis. Thus it is possible for instances to come and go according to circumstances; a ten minute gap doesn't much matter. This sort of application is well suited to running as a split of one spot price and one on-demand Auto Scaling Group, with a suitable metric and scaling policies to scale up the on-demand ASG when the spot price ASG loses instances, and scale it back down when spot price instances fire up again.

Real-time, Synchronous, or Other Time-Constrained Applications

Applications that must serve content constantly or must be running at a specific time all have some variant on the constraint that at least X instances of type Y must exist at a given moment. There is no easy or straightforward way to involve spot pricing in the core set of instances that are required for the application to meet its demands. Any attempt to do this will result in either intermittent outages or sometimes paying very high spot price costs - pick one.

It is, however, possible to accept the risk of spot pricing for spare capacity that is kept around to handle sudden demand surges, those that occur too rapidly for auto scaling to produce a new instance in time. AWS isn't fast, and it can take five to ten minutes to bring new instances online even when images are used. The spare capacity could consist, as above, of a pair of ASGs, one on-demand and one spot priced, with a metric set to scale up the on-demand instances when spot price instances terminate due to price increases. In this case, the period of risk is cut down to the five to ten minutes of time needed to bring up new instances, and that can be balanced against the likelihood of a surge in usage occurring in those gaps of coverage. Whether or not this is an acceptable trade-off is very dependent on the organization and application. In my experience, it would have to be an extremely high volume application for this to save enough money to be worth the trouble.

Nothing is Ever Simple, However

If there is a way for spot price fluctuations over time to cause havoc and extra cost, then be assured that it will happen. Consider even a simple queue processing application with a spot price and an on-demand ASG, coupled to scaling policies to run on-demand instances to pick up the slack when there are no spot price instances. What happens when the current spot price oscillates around the bid? Spot price instances will be created and terminated rapidly, and thus on-demand instances will be created and destroyed in response, and that will happen just as rapidly as the scaling policy allows for. Whenever an on-demand instance is launched, the account is charged for an hour of use regardless of whether or not it is then immediately terminated. Thus it is quite possible to burn more money than is saved for certain classes of spot price behavior - something that is a common theme throughout this post.

A sensible response to this is to write the code needed to create a custom metric or scaling administrator application that is better capable of ironing out the edge cases likely to cause thrashing in on-demand instances. This is a rabbit hole of potentially infinite depth, given the nature of the system. Another approach is to create monitors and alerts, and assign the necessary support personnel to keep a close enough eye on things to adjust the bid as necessary whenever a problem of this nature occurs. In either case, the system as a whole becomes something other than simple and straightforward. The goal of an automated spot pricing strategy that runs itself and never gets into trouble is probably a mirage for most applications in most organizations, and the only real question is the degree to which the outcome is automation versus human intervention.