ETL Uncovered

If you want to deploy a system of business intelligence or analyze old records for your information system, data quality is one of the top concerns! For a company to thrive in the market today, data consolidation and analysis is critical, quite literally.

ETL is a three step process in database management and data warehousing:

  • Extract the data from various sources into a single repository, that could be homogeneous or heterogeneous. Those data sources can be relational databases, spreadsheets, XML files, CSV files etc.
  • Transform the data into the desired schema. This could include various mapping functions to cleanse the data before it is moved into the new system, or filtering the data to make it concise and ready to use, or changing the data format using certain rules or lookup tables (since the format of the old system could differ from the new one!)
  • Load the transformed data into the destination database, or the warehouse.

Notice here, that we are first transforming the data and then loading it into the final warehouse, in contrast with ELT tools (Extract Load Transform) which first transfers the raw data into the warehouse and then prepares the information for after-transformation uses.

Have more than one transformation processes running on the same database? No worries! You can schedule the ETL jobs like batch processes!!

When do we need ETL?

Let’s take an example for better understanding. Suppose you have an online order processing system which processes the orders that customers place at you eCommerce. The system stores the order even after they are shipped (with the status “completed”). But this might clog your database with a humongous amount of old orders. You might want to move those completed orders for better management, data mining and analysis to a new system which contains just those orders.

So, to consolidate the historical data from all disparate sources, you set up an ETL system, which transforms the data from the smaller databases into the more meaningful long-term databases.

Based on the complexity of your working environment, you can pick up a suitable ETL tool for your purpose, or lo and behold, you can even build one of your own using a suitable programming language! Speaking of complexity, how exactly to we know how complex is our environment? Define how many source systems we have feeding our ETL system.

Next define what kind of transformation is required in your data and how difficult it to employ this transformation to fit the existing data into the target system. Lastly, design a feedback loop to constantly check for errors and discrepancies in our ETL system. There you go! You now have your own Extract, Transform and Load!

Leave a Reply