The Challenge
The client’s business activity was to crawl the web to save data about millions of products. The client was using a custom designed crawler that would save each web page crawled into a MongoDB database. The database contained multiple terabytes of data and cost the client almost 6 figures for their managed instance of MongoDB.
The Solution
The client hired our services to help them re-architect a better solution to store the data and reduce cost while maintaining the ease of use of MongoDB.
We went ahead and created a data lake using Apache Hudi after evaluating Delta Lake as a possible solution. The ingestion rate of the crawler was around 1 TB per hour, and the data was put on a data queue that would then persist it in the MongoDB.
Our Approach
Using S3 Storage: Our approach was to initially fork the data in the queue and persist it in an S3 bucket. S3 storage is an inexpensive way to store large amounts of data, especially compared to MongoDB, where the storage and the computing power are coupled together. By keeping the data in S3, we can now separate the cost and the resources needed to store the data from the cost and resources required to process it.
Using Apache Hudi: We then used Apache Hudi to create a data schema to store all the information scraped. This was already an improvement; MongoDB was so expensive that the data was transformed in the queue to reduce its dimensionality and size.
Transforming Schema: We then had a step function that would fire at a specific interval. This step function would fire an EMR cluster running PySpark and run a custom script that would read the data from S3, transform it, and move it into a transformed schema in our Hudi lake. This transformed schema was exposed to other teams via Athena and Redshift spectrum.
Data Availability: One of those teams was the data science team, which now had cleaned, transformed, and structured readily available data. Previously while using MongoDB, they had to pull data from a read node every time they needed to run one of their models. This would push MongoDB read replicas to almost its limit.
Benefits & Outcomes
This solution reduced expenses from the high 5 to the mid-4 digits. It allowed multiple teams to access all the data going as far back as the company’s beginning without a great additional expense of creating a new system to ingest data.
This setup also unlocked additional possibilities as ML models could now ingest all data at once rather than doing data pulls from MongoDB a few chunks at a time.
Given the separation of storage and processing power, the business was able to scrape even more sites without having to worry about the cost incurred from adding additional sites and products to the portfolio.
We were able to manage data at the record level in S3 data lakes in order to simplify Change Data Capture (CDC)
It helped stream data ingestion and handled data privacy use cases requiring record-level deletes and updates.