Constructing A Data Pipeline From Scratch
Untitled is frequently asked how we approach constructing a data pipeline from scratch. Through trial and error and multiple projects that had complex implementations, we believe we have created a fool proof system to ensure a successful data pipeline implementation. In this post, we will describe our process for designing and implementing a data pipeline.
First and foremost, we are adamant about communicating in both a technical and non-technical manner so that any stakeholder in an organization we work with has a clear understanding of the data pipeline construction process.
We know that the companies we work with may have set processes or bandwidth constraints for a variety of reasons. With that said, our job is to not only build an amazing and effective solution for your company, but also ensure that every constituent’s needs and problems are serviced through the pipeline implementation.
Step 1: Discovery and Initial Consultation
The first step of any data pipeline implementation is the discovery phase. We never make assumptions when walking into a business that has reached out for our help in constructing a data pipeline from scratch. The goal of the initial consultation is to get an understanding of the current problems the organization is experiencing, and how we can architect a solution that solves these problems.
In this phase, we also will discover all the data sources the organization currently uses and has access to. It normally helps to have a technical person from the organization who knows where these systems live and their design specs for this initial meeting. It’s also helpful to retrieve some sample records from the data sources we discover to get a better understanding of current data hygiene.
Step 2: Research and Solution Design
Once we have gone through the initial consultation, we begin researching and designing the pipeline to meet the requirements of the organization. This part of the process is critical when it comes to how effective the pipeline will be, and how to implement a data pipeline from scratch without consuming vast amount of resources and time.
Typically, we take 4-5 business days to research and design a schematic of the solution we would like to implement. This schematic provides a well-rounded idea of how data will flow through the entire organization’s pipeline.
Step 3: Recommendation and Research Around Tech Implementations
This phase of building a data pipeline from scratch is where we will make recommendations and have discussions with the client regarding what new software to implement and what systems we can augment or replace. In this step, we will typically pick the data warehouse with the client (if they don’t already have one) as well as the business analytics solution they would like to use for the end-destination of the data.
While we generally gravitate towards the AWS environment for a data warehouse for cost, we can work with you if Microsoft Azure, Google BigQuery or another warehouse solution are more suited to your needs. When it comes to business analytics solutions, we steer clients to choose between PowerBI or Tableau. In rare cases, some clients pick custom solutions such as Pericope Data. However, this route is a much more expensive option and we typically advise against it unless a client operates at a very large scale.
Step 4: Scoping and Proposal
Once we have come to an agreement with the client on the solutions to be implemented, augmented or replaced, along with consensus on the design specifications outlined in the schematic, we then scope out a timeline of execution and create a proposal for the data pipeline. Every client’s needs are different, but our goal is to try and have the big pieces of the pipeline implemented in 90-120 days from having a signed scope of work.
We’ve noticed that if the pipeline is estimated to take longer than 120 days, the client is trying to do too much at once. Instead, we’ve found it’s better to break up the project into an MVP implementation that will reach the low hanging fruit of building the pipeline, then take care of the rest of the project in a separate scope of work.
Step 5: Break Ground and Milestone Execution
When Untitled is constructing a data pipeline from scratch, we create extremely clear and attainable milestones for the client to measure us along the way. This also helps to ensure the project stays on track and is delivered on time.
Once a signed proposal is returned, we make it a goal to knock out a majority of the big problems regarding the implementation in the first 45 days. This typically involves a lot of data cleaning, modifications to an organization’s data schema and getting all of the systems to drop the data in a single warehouse such as AWS.
Step 6: ETL, Modeling Data and Reporting Automation
Once all of the data has been accurately dropped into a data warehouse and we have built automation around this process through extract, transform, load (ETL) scripting or ETL solution implementation, we then connect the warehouse to the business analytics solution the client has chosen. If for example, the client picked AWS as their data warehouse and PowerBI as their analytics solution, the process would look like this:
- Ensure the data is correctly dropped and logically bucketed is AWS S3
- Use AWS Lambda to ETL the data from S3 into the database of choice, such as AWS Redshift
- Connect the Redshift server to PowerBI
- Pull through all of the data from the Redshift server into PowerBI through the query editor interface and set a schedule for how often the data should be refreshed
- Model the data into a dashboard to the specifications of the organization to display key performance indicators or information that the company needs to see on a daily, weekly and monthly basis
- Build time sensitive reporting around the dashboard that can be sent out via automated email to key stakeholders every day, on a weekly basis (or some other time parameter) or based upon trigger events (such as forecasted inventory stock outs)
Step 7: Support and Training
Once initial modeling is complete, we offer training and support for the data pipeline and analytics solution on premise or over the web. Our goal is to not treat a data pipeline implementation as a one-time project, but rather as a growing solution that scales to the needs of the organization and continues to build value and positively impact ROI for the client.
We stand by the solutions we implement and desire to use the data pipeline as a means to help our partners become more data-centric and progressive in their digital transformation process.
We hope our process of constructing a data pipeline from scratch sounds straightforward and understandable to you. If this post has sparked your curiosity and you would like to engage in an initial data pipeline consultation, please reach out to us through the contact form. A member of our data analytics team will follow up with you promptly to learn how we can help your company leverage the data sources and systems you have in place to gain a competitive edge for your organization.