Modern Data Lake on AWS for CBR Sciences

DataGrokr built a modern data lake for CBR Sciences, a leading marketing analytics consulting organization that specializes in building models of customer behavior. The data lake ingested data from various third-party providers providing billions of transaction data and has a size of 45TB. The complete cloud infrastructure was built up by DataGrokr to PCI compliant security standards. The data lake uses various AWS data services for transforming and storing the data. Gitlab CI/CD pipelines are used for code deployment and orchestrating the data pipelines.

About the Client

Our client CBR Sciences, a leading marketing analytics consulting organization that specializes in building models of customer behavior. They acquire data from a variety of public and proprietary data sources to create customized models to serve the needs of online businesses. They help level the playing field allowing small business to compete with larger enterprises.

Client’s need and Problem statement

The Client CBR Sciences, was looking to build a scalable and cost-effective data platform that would allow them to ingest data from multiple data sources, process and store the data. The platform had to be highly secure and needed to meet PCI security standards. They also wanted the platform to have modern ML Ops capabilities and facilitate data scientists and DevOps professionals to interact in a seamless fashion to enable shorter deployment time for their models. They were looking for a solution provider who not only had expertise in Data Engineering but also who could build their cloud environment from scratch and help them adopt ML Ops principles.

Tech Stack

Python

SQL

S3

Batch

Athena

Lambda

KMS

Step Functions

Docker

EventBrite

Bash Shell

GitLab

SonarQube

DynamoDB

SageMaker

Secrets Manager

QuickSight

Our solution and outcomes

We designed and developed a comprehensive data ingestion and processing platform using a variety of AWS services. S3 was chosen as the storage layer to save on costs as well as since the data did not have too many updates. RDS Postgres was used for data that did have updates and was exported to S3 once it became stable. Data transformations were done using Glue and Athena. The data pipelines were optimized both for cost and performance. AWS Sagemaker was and Gitlab were used to implement MLOps practices.
The entire platform was built by a single scrum team of experienced data engineers and cloud engineers within 6 months. The operational cost of the platform was a fraction of the cost (<10K of monthly AWS spend) of their earlier platform which handled only a fraction of the data.
DataGrokr continues to provide on-going support for the platform. We have also embedded our cloud engineers with their data scientists to help them transition to modern practices of building and deploying models.