Automate your ETL pipelines today
Building one ETL pipeline manually is simple. But tens or hundreds? That's when you need to automate.
Agile automation built around an enterprise ETL process
- Instantly connect to any data source without writing a single line of code
- Choose from an extensive list of pre-built connectors
- Quickly create custom connectors with our world class experts
- Convert data to the schema that your team requires
- Incorporate business logic that's unique to your business
- Generate foundational data that you can rely on
Load & Analyze
- Land data ready for analytics into any cloud destination
- Sync your data with leading business intelligence tools
- Get the clearest possible view of your opportunities
Switchboard steps in when ETL falls short
Implement a smart ETL platform
Easily create and manage multiple API calls into multiple sources. Extract only the specific, granular data you require for use. Only pull the data you need when you need it.
Drive real-time analytics
Automate with a comprehensive SaaS platform combined with world-class, dedicated customer success engineering. Non-technical business teams get to the desired dashboards, when they need it, regardless of how many data pipelines you have or how many ways you need to cut the data. Be free to focus on analytics and insights.
Ensure security and compliance
Rely on best in class controls, security and compliance practices with Switchboard, a SOC-1 and SOC-2 certified service company. Trust a single source of truth for your data with the utmost confidence.
“Switchboard enabled us to get our first data sets onboarded before new engineering resources were able to be brought in, so they could hit the ground running. It's that kind of velocity and agility that has enabled us to take advantage of the data opportunity so quickly.”
Ameen Kazerouni, Chief Data & Analytics Officer, Orangetheory FitnessRead the Full Story
ETL: The ultimate guide
ETL is a critical process for any data-driven enterprise. Switchboard automates this process and adds proprietary components to make the output ideal for analytics. If you're wondering: “what does ETL actually mean?” when it comes to applying it to your data, then read on.
In its simplest sense, 'ETL' (Extract, Transform, and Load) is a term used to describe the overall process of extracting raw data from multiple disparate sources, transforming that raw data into foundational data, and loading it into a data warehouse.
Sounds easy, right? Implementing ETL is anything but simple. When revenue and operations executives talk about data unification, we often hear the response, 'Oh, that's just ETL'. But what exactly is 'ETL'? And why shouldn't you dismiss it as a trivial process? Why is it so hard to perform manually and when should you look to automate? Let's dive into some of the complexity involved to answer all these questions and more.
'Extract' refers to the process by which the raw dataset is copied or exported from its original locations into a temporary staging repository, before it can be moved to its final destination. Extraction must be processed correctly so that data can be successfully passed to the subsequent phases.
Data unification almost always involves multiple sources. Moreover, the data can be structured or unstructured. Structured datasets comprise well-defined types and patterns that make them easy for computers to parse, such as an SQL relational database, whereas unstructured datasets are normally classified as 'everything else', such data from audio, video and social media content.
There are many factors to consider during the extraction phase, and each comes with a plethora of questions which need to be answered:
- Timing is important - When do you extract the data? What timezone are the timestamps in the dataset?
- Sometimes a source fails or is missing - Do you then re-pull data from previous days and backfill? What do you backfill and how? How often do you pull the data?
- Sometimes extraction involves API calls to a data source - Do these have a rate limit? Does the API have a quota? What happens if the API is down? Do you know how the API works? Who maintains the code used to connect to the API? Do you have engineers who know how to manage sudden or unexpected API changes?
- Security and regulatory compliance - What credentials are required to extract the data? Data extraction needs to be carried out securely - how is this security maintained? Who is responsible?
- The scale of the data - How large are the datasets? Do you have the resources to extract that much, in terms of compute, storage, and scale, but also time and engineering resources? What happens when they grow in variety or volume?
- All of these factors affect cost - Have you budgeted enough for the engineering resources involved in extracting all your data? Are you extracting too much data, leading to unnecessary warehousing costs?
All these questions - and many more - must be answered comprehensively for extraction to be successful.
During this phase, the raw data is transformed into foundational data. A series of rules are applied to the extracted dataset to aggregate and normalize it for its intended use. This involves several overarching processes, each with their own set of sub-processes:
- Data cleansing - Inconsistencies and missing values in the raw data must be resolved. This involves filtering, deduplication, validation, and authentication.
- Standardization - Formatting is applied to the data. This involves calculation, translation, and summarization, such as unit conversions and tallies.
- Verification - Unusable data and anomalies need to be flagged and removed. This involves auditing the dataset to ensure it is of sufficient quality, and that it is compliant with its use-case.
- Formatting and sorting - The data must be configured to match the schema of the intended target location. This requires sorting the dataset according to type, and placing it in tables or databases of the correct format.
- Labeling - during transformation, files should be labeled correctly and consistently to ensure the data can be loaded and used properly.
- Governance and protection - Any dataset which is subject to regulation may need to be encrypted or removed.
As with extraction, each step of the transformation phase will throw up a number of questions and complications. What are you transforming? Where are you transforming? How do you know the rules you are applying are correct? How can these rules be audited? Do you have the processing power to apply all these rules? What's the processing cost? When the extraction process changes during transformation, such as Facebook adding a new data point, how do you manage that change? This would affect scheduling. Do you return to the extraction phase to backfill and fix it? How do you track the changes made to your processes?
The scale of the dataset will also affect a number of factors: How do you process a large dataset, and how fast will it be? Who oversees validation? For example, this could be an engineer or a data analyst. This is not an exhaustive list of considerations, but a sample of the questions which need answering in the transformation phase.
The final phase of ETL involves loading the transformed data into the target destination, which is usually a data warehouse or data lake. Typically, a 'full loading' is implemented initially, which includes all data, followed by a periodic 'incremental loading' of any changes. During incremental loading, the incoming dataset is compared with existing records to determine whether unique information is available. If so, then data points are overwritten or new records are created.
While 'load' may seem like the simplest phase, there are also a myriad of questions to consider here.
- The properties of the target destination - Where are you putting the data, and who is responsible for maintenance? Is the dataset correctly cataloged?
- Support for the business team - Does the business team, who rely on the dataset, know how to use it? What happens if some of the data fails to load, or is malformed? If a new source is needed, who does the business team need to contact? For example, if they find they want to integrate Snapchat data, how do they request this? Do they need to go back to the team taking care of the extraction phase?
- Maintenance cost - Data must remain valid. Who is responsible for this and how much resource will it take?
- Archiving - What about archiving old data? Do you have a naming convention? How do you know if these processes aren't running, and who's monitoring it?
- Data governance - Is the dataset under any form of governance which requires certain rules to be followed? Is there any PII (Personally Identifiable Information)? How do you audit and regulate the data?
It's relatively straightforward and inexpensive to activate data warehouses. But all of these factors - and the scale of the data involved - mean data storage can quickly become expensive and time-consuming to maintain.
Broadly speaking, ETL tools are software platforms which help facilitate ETL processes, and there are several different types. Enterprise ETL tools are commercially supported, but are often expensive. Open source tools are free and the source code can be modified according to your needs, but they come with no support.
There are also cloud-based ETL tools, which leverage cloud technology to provide greater availability and elasticity. However, these platforms usually do not support datasets stored in other locations, so they must first be transferred to the provider's platform. Some companies even develop their own custom ETL tools, the advantage being that the resulting tool is flexible and bespoke to their needs. The disadvantage is that developing your own tool requires considerable internal resources, including testing, maintenance, and updates.
ETL is usually referred to in three stages, but in a real-world data unification scenario there are at least five ('ECTLA'):
- Extract - Capture data from disparate sources into a temporary staging location.
- Clean - Ensure the quality of the data prior to transformation through a series of standardization and normalization steps, removing any defects and invalid records.
- Transform - Process the dataset by applying a series of rules, then convert it to the correct format for the intended target destination.
- Load - Transfer the data to the target destination, which is usually a data warehouse.
- Analyze - Evaluate the finished data to gain insights for business uses.
Data integration is the process of combining datasets from different sources to provide a unified view of them. This is part of the 'extract' phase in ETL.
How you build an ETL pipeline really depends on the tools you're using. The two programming languages most commonly used for ETL are SQL and Python.
A data pipeline usually refers to a production-ready procedure that has proven to be reliable and secure, so can be used repeatedly. In contrast, ETL is a unique and bespoke event which produces a major change in the nature of the data. While data pipelines involve transferring datasets, they do not necessarily transform them. An analogy would be to compare a delivery firm with a moving company: A data pipeline is like a global courier with repeatable processes, monitoring, and scale, whereas ETL is like a moving truck used for a one-off transportation between homes.
Here are the general steps you would follow to create an ETL pipeline using Python, or any other appropriate programming language:
- Hire engineers - This may be a single engineer, or, depending on the scale of the data involved, a team of engineers.
- Procure the credentials - These are needed to access the data sources, and must be provided to your python environment.
- Write the extraction code - This is the code required to connect the various data sources.
- Produce the cleaning code - Write the code to clean and standardize the raw data.
- Write the transformation code - This applies the required transformations on the data.
- Create the loading code - Write the code to load data into the target destination.
- Perform ETL testing - Deploy the code securely and analyze the results.
- Set up monitoring and alerting - This is to manage your server's resources and the scaling of the data.
A simple, one-off, ETL pipeline can be easy to learn and implement. However, the process quickly becomes complex as you add more pipelines and sub-workflows. Building one pipeline is easy, but building 50 is difficult, so ETL rapidly becomes harder as the scale of the data increases. Moreover, the need to execute multiple factors well, such as engineering, security, governance, monitoring, and maintenance, make a real-world ETL pipeline challenging to learn in practice.
Creating ETL processes from scratch requires a great deal of coding, as well as proficiency in multiple languages and technologies, such as Dev Ops, monitoring, data warehousing, vendor management, and credential management. As the number of pipelines - and the resulting complexity - increases, so too does the need for automation to control them. However, there are many ETL tools which make this process much easier. There are also ETL service providers, who take care of the whole process for you, so you never even need to touch a line of code.
As a reliable business process, ETL absolutely can and should be automated. Writing ETL operations code is complicated and error-prone, and this becomes difficult to troubleshoot in highly complex pipelines. An automated tool allows your engineering team to easily design, execute, monitor, and test an ETL pipeline, usually via a simple drag-and-drop GUI. Why write ETL code from scratch when it's probably already been produced better elsewhere? The use of built-in templates and connectors minimizes the need for manual programming and oversight, making the ETL process much faster to create, and easier to maintain.
Catch up with the latest from Switchboard
Marketing and revenue teams can stand up analytics and AI projects 10x faster through automated data engineering platform Switchboard, the leading data engineering automation platform,…