Last month, our team published a blog post titled How we reduced the AWS costs of our streaming data pipeline by 67%, which went viral on HackerNews (Top 5). Clearly, developers are hungry to learn about new AWS cost-saving strategies.
We’ve had a lot of questions about AWS cost optimization stemming from the original post. However, this question from Carl at Klarna inspired us to write another post:
Hi, Louis, Can't you write another blog article on how you moved Glue jobs to ECS? Would be interesting to read since we are heavy glue users at Klarna.
So without further delay, here’s how we approached the problem and made the switch.
For some quick context, we provide an AWS cost optimization solution that monitors and alerts on cost anomalies and wasted spend in real-time.
Initially, to justify the cost of AWS Glue, these are the primary benefits we had in mind.
In typical AWS fashion, several issues led to us spending multiples more than expected. All while getting less out of AWS Glue than expected. Considering the work the pipeline was doing, here are some of the reasons why we decided it wasn’t worth the cost:
Testing required running the testing gateways and endpoints. It was easier with integrated Sagemaker notebooks. Both were expensive and accidentally left on all the time by our developers. 😴
Spark was in some sense “bastardized”. There were things that we couldn’t do in AWS Glue. We were wasting a lot of time working around some of the inherent limitations that were causing high costs. For example, while overriding some of the parquet implementations was easy in Spark, this proved difficult in AWS Glue.
Crawling was deceptively expensive. We had to access the data frequently, so crawling seemed almost constant. It also took much longer to crawl our directories than we anticipated. This was in part because of the large number of files being created (parquet issues). We made partitioning decisions that were very complicated vis-a-vis our AWS costs.
There often was a large lag between requesting the start of a running job and the job launching. Our event mechanism made it difficult to have enough information from events to manage a business process (e.g.: “crawler started”). We sometimes overran our jobs because we could not tell for which clients they were running for.
We were also running a Flink pipeline. This had to do mostly with our history. In the early days of Taloflow, we had set up Flink for our streaming data, originally on Google Cloud Platform. When we started moving some items over to AWS we started with an AWS Glue pipeline. It frankly seemed easier at the time with our limited resources. We succumbed to the siren song of AWS’s, “let us manage everything for you!”.
I am sure many of the above could have been solved if we continued to pound away at our learning curve for AWS Glue. However, we began to wonder what the point was in going through that exercise. We asked ourselves:
Spark or Glue experts may say that if we tweaked and optimized the pipelines in Glue that we would see dramatic savings. Yes, but the point is that we reviewed things within a "total cost" framework. This includes a realistic look at the available skill-sets in our dev team. We saw that we didn't have the bandwidth to carry specialists. We also didn't want to hire an outside specialist given the importance of the pipeline to our core business.
We reviewed the actual amount of memory that the jobs were taking while running AWS Glue and did some calculations on our data flow. The maximum Fargate instance allows for 30GB of memory. We knew that if we were going to move to AWS Fargate, we had to fit within this.
We reviewed how long it took to run the job, restricting the amount of parallelism to 4 CPUs, which is the maximum amount of CPU available in AWS Fargate. These were jobs that tend to run on a schedule at regular intervals throughout the day rather than on-demand. The throughput time was not a critical factor for us as long as it was within our Quality of Service (QoS) goal. We established a guideline time to run our largest customer and used that as an outside estimate and test for our QoS.
We knew that we were not going to try to run Spark on Fargate. Has anyone tried?
We decided that some of the work would move to a Flink pipeline that we were already running. It had some unused cycles and was recently set to scale up and down with AWS Spot instances. Moving these pipelines was straightforward. We converted from Spark to Flink using essentially the same flow and logic.
There was a second set of pipelines that were more computationally challenging and had been a struggle in Spark. We decided to move them to Python and Pandas using various parallelization techniques. We did this because we felt that even though the Python runs a bit slow, it was fast enough for us. We had a lot of Python in our data science loop. We had a lot of resources to draw from to help with the port so that we could get it done fast.
There were lots of other good choices for the pipeline. It depends on the use case and organizational knowledge.
As we ported over we reworked the pipelines to be size sensitive. For example, the Spark pipeline would work on the entire dataset. This included all AWS service codes (monitored by our AWS cost management service) at once for a particular client.
We added an inner loop so that the Python works on one AWS service code (e.g.: Amazon EC2) at a time. We made sure that memory gets cleared between the service codes. With some minor rewrites, we were able to dramatically reduce memory consumption. We fit on the 30GB Fargate instance! It did take several iterations to be 100% certain that all clients would safely fit the instance.
One of the benefits that we picked up in moving is we ended up with a much better alerting and monitoring flow. We can alert at a much finer level and we are no longer restricted to the AWS Glue error codes and eventing mechanism. A lot of the AWS Glue logging moved to CloudWatch logging and got picked up by our other systems. We were having issues in the past using the AWS Glue logs.
For the most part, we launch AWS Fargate instances from Lambda scripts triggered by SQS queues and CloudWatch events. We recently began triggering Fargate from an SDK within a Business Process Management and Notation (BPMN) process server that is running as well.
As an interim step, we wrote a simple “caching-launching” service that makes sure that our pipeline steps are launched and monitored using a combination of the following:
We will move a bunch of this logic to the BPMN processes over time for visibility and easier maintenance.
We did write a simple utility that cleans up garbage AWS Fargate instances. For whatever reason, some failed or were caught between deploys. Luckily, our service alerted us the first time this happened. We were running a few hundred zombie Fargate instances and we were able to address it quickly. 🧟
Refactoring and running our jobs as Amazon ECS tasks on AWS Fargate cost us 1/3 (or 60% less) than what the AWS Glue runs had cost.
Thank you for the great question that inspired this post, Carl. We invite anyone to email us with any questions - it might inspire our next post!: [email protected]