Canadian Energy Client

Databricks & AWS Proof-of-Concept to improve processing of Data Engineering Workloads

Project Details

Background

A Canadian intermediate Oil & Natural gas producer had a hybrid on-premise and cloud data environment for data engineering workloads. Long processing and run times for data pipelines were preventing timely insights for the business and the IT group is spending too much time on maintaining existing data infrastructure. Costs had recently skyrocketed, and the organization was not prepared to pay significantly more to upgrade their existing deployment beyond the current configuration. The organization sought to evaluate Databricks and AWS native services (Glue, Redshift) via a proof-of-concept to evaluate how price, performance, and ease of use of these services compared to the current solution, leveraging their own data. Data Elephant assembled a small team of AWS and Databricks experts and completed the POC in two weeks.

Challenge

Long run times for data refreshes were becoming a business issue, particularly during peak daytime hours. Costs were estimated to increase by 2-3x for more DW units to speed up existing ETL workloads. A wide variety of pipelines existed with several individual pipelines taking over 1 hour to run (POC pipeline chosen was for finance data taking >60 mins runtime). The organization’s internal data team was small, with minimal cloud or Databricks skills in-house.

Outcomes

  • POC Use Case Pipeline Selection and Testing Scenario Design

  • AWS Architect Design and Setup provisioning core POC services

  • Migration of POC Pipeline SQL Code and Data (“as-is” and rearchitected)

  • Detailed Performance Testing and Cost Analysis of POC in AWS-native services and in Databricks

  • Recommended Future Use Cases

  • POC Result: Current runtime in Synapse (>60 mins); runtime in AWS-native (6 mins) and Databricks (3 mins)

Why AWS? The client was looking to AWS as an alternative cloud provider for their current data pipelines. The ease of setting up a new AWS environment, migrating existing SQL pipelines to Glue and Redshift, as well as the ease of setup of Databricks on AWS, drew the customer to the platforms.

Why Data Elephant? Data Elephant was recommended by Bonavista’s AWS Account Executive as a flexible partner with expertise in designing, executing and measuring the cost and performance of rapid POCs. Their expertise on AWS and Databricks, as well as experience in migrating from Synapse, enabled the team to diagnose and solve for core challenges in this POC.

POC Solution Components S3, Redshift, Glue, Databricks on AWS (via Marketplace)