ETL is an acronym for Extract, Transform, and Load. It refers to a process of extracting data from source systems, transforming the data in some way (manipulating it, filtering it, combining it with other sources), and finally loading the transformed data to target system(s).
ETL is a process prominent especially in data warehousing. The term is an industry standard, representing data movement and transformation. It performs the following functions:
Extract data from homogeneous or heterogeneous data sources
Transform data - cleansing, reformatting, standardization, aggregate, join with other data, and apply business logic
Load the data in specified target tables, systems, or flat files.
The ETL processes are reusable components that can be scheduled to perform data movement on particular time intervals. Used as batch processes, ETL jobs support massive parallel processing of data. Generally they are easy to maintain and scalable.
It is used for movement across systems when a high volume of data and complex business rules are involved.
Big Data and the ELT approach (yes, ELT)
Traditional data integration was performed using batch processing (data on the rest), while big data integration can be done in real-time or with batch processing. Which make the ETL phases reordered to become ELT in some cases, so the data is extracted, loaded into distributed file systems, and then transformed before being used.
Commonly used ETL tools are, for example:
- SQL Server Integration Services (SSIS)
- Informatica PowerCenter
- IBM Infosphere DataStage
- Syncsort DMX / DMX-h
- Oracle Data Integrator
- Oracle Warehouse Builder
- SAS Data Management
- CloverETL
Open-source solutions are available as well, such as
- Pentaho Data Integration (Kettle)
- Talend Open Studio.