The ominous opacity of the AWS bill - a cautionary tale

Frederico Ramos

Frederico comes from a rich background of software development as a software developer and software architect (nearly 20 years). Fred has considerable exposure to Drools and rules engines.

January 15, 2019

This is a Taloflow engineer’s real account when attempting to provision a bill for a client. Does it ring a bell?

First, some background:

We were only in the first week of the month-long billing period for our client’s AWS account. Already, it showed that they had exceeded the free-tier limit for SQS and had nearly exceeded it for CloudWatch too (approximately 85 per cent used). This is puzzling, because we hadn't run any data downloads for the client at all. In fact, all services had been down since before Christmas when we shut it down to work on new server CloudFormation scripts.

The facts:

Naturally, we got a message from our client asking how we could have reached the free-tier limit so soon. I went into the AWS console to find out what I could. First, on the billing dashboard, there were no details on usage. Furthermore, I had zero information on how many messages I had run so far this billing period for SQS. The situation was similar with CloudWatch. The only thing posted was a notification that the limit had indeed been exceeded, in addition to an estimate of how far over the limit the client would be by the end of the billing period (at the same usage pace). In this instance, it was predicted that the client would spend more than three times the free-tier limit for CloudWatch alone. Next, I tried to check the metrics. However, CloudWatch has no metrics itself, so I could not review the number of events or logs that had been ingested so far. On SQS, I could only see a handful of messages— which all added up, should have still been well below the one million free-tier limit. At this point, I was very unsatisfied with the sparse details, so I sent a message directly to the AWS support team.

My initial request:

"...I checked out account billing info today and I can see that we exceeded the free-tier on number of requests for SQS and CloudWatch log ingestion. But we haven't been running our application since around December 20, so I cannot understand how we could have ever got that far on usage for those two services.Where can I find details on that usage?For SQS, I tried CloudWatch metrics but couldn’t find the message count for the queues. On CloudWatch, I tried the ‘Insights’ feature but I could only account for a handful (less than 100) log events for the past week.” I received a prompt reply, which unfortunately did not answer my questions, but it did shed some light on how difficult this problem is to solve. Because, the AWS support team itself does not have the tools to extract the information. In other words, the first line of defense, internal to Amazon, has no observability. Which leaves me wondering: what hope is there for my client?

Bottom line:

As a customer, it is hard— if not impossible— to be sure that the billing information on AWS is accurate. The information is there, but inaccessible. I imagine that they will have to ask the tech team to run scripts on the logs to collect the necessary information to support the numbers presented on the bill. I am not sure how frequently the warning messages on the AWS console are updated. But the situation certainly demonstrates the need for real-time cost analysis. When Taloflow is added to this client, I will then have immediate access to the cost evolution for these services. Taloflow can warn the customer about approaching free-tier limits much faster than AWS does. In addition, it will have the accurate numbers to support such a warning. Here is our documentation on Tim— the Taloflow Instance Manager.