Client business activity was to crawl the web to save data around millions of products. The client was using a customed designed crawler that would save each webpage crawled into a MongoDB database. The database contained multiple terabytes of data and was costing the client almost 6 figures in costs for their managed instance of MongoDB.
The Challenge
Need for rearchitected web crawler solution with ample storage for mammoth data and reduced maintenance cost!!!
Share To:
With Web Crawling as a major business activity, our client knew it needed an ideal crawler that could save data around millions of products. our client already had in place, a custom-designed crawler that would save each webpage crawled, into a MongoDB database. The hitch here however for our client was, their database contained multiple terabytes of data and was costing them almost 6 figures in costs for their managed instance of MongoDB.
Resolution Summary
It is indeed possible to easily store mammoth data with reduced maintenance cost as its icing at the top!!!
Share To:
Our team of core-specific technology experts facilitated our client to seamlessly re-architect a state-of-the-art solution that could seamlessly store data and at the same time spend far less money to maintain the easy use of MongoDB.
Resolution in detail
Leveraging decades of expertise of our technocrats!!!
Share To:
We went ahead and created a data lake using Apache Hudi, after evaluating Delta Lake as a possible solution. The ingestion rate of the crawler was around 1 TB per hour and the data was put on a data queue that would then persist it in the MongoDB.
Our approach was to initially fork the data in the queue and persist it in an S3 bucket. S3 storage is an inexpensive way to store large amounts of data, specially compare to MongoDB where the storage and the computing power are coupled together. By storing the data in S3 we can now separate the cost and the resources needed to store the data from the cost and resources needed to process it.
We then went ahead and using Apache Hudi create a data schema to store all the information scraped. This was already an improvement, using MongoDB was so expensive that the data was transformed in the queue to reduce its dimensionality and size.
We then had a step function that would fire at a specific interval. This step function would fire an EMR cluster running PySpark and would run a custom script that would read the data from S3, transformed it and moved it into a transformed schema in our Hudi lake. This transformed schema was exposed to other teams via Athena and Redshift spectrum. One of those teams was the data science team who now had cleaned, transformed and structured data readily available to them. Previously while using MongoDB, they had to pull data from a read node every time they needed to run one of their model. This would push MondoDB read replica to almost its limit.
Ground-breaking business impacts
An all in one – reduced cost solution having timeless historical data access with no extra investment!!!
Share To:
Our solution reduced the client’s expense from the higfh 5 digits to the mid 4 digit. It allowed multiple teams access to ALL the data going as far back as to the beginning of the company without an additional great expense of creating a new system for them to ingest data.
Feel the power of instant, wholesome data ingest ML models, more site scrap-ability at decreased cost!!!
Share To:
This setup also unlocked additional possibilities as ML models could now ingest all data at once rather than doing data pulls from MongoDB a few chunks at a time. Better yet, given the separation of storage and processing power, the business was able to scrape even more sites without having to worry about the cost incurred from adding additional sites and products to the portfolio.