ETL for Data Warehouses – The SSIS vs. T-SQL debate

I see many questions about using the built-in components in SSIS versus writing T-SQL code to perform dimension/fact updates, especially type 2 dimension changes (changes that track history).

In the old DTS days (SQL Server 2000), the choice was pretty simple and straight-forward. DTS (Data Transformation Services) was a great tool for controlling workflow and automating the environment, but it was terrible at performing in-memory and/or set-based operations. Add to that the fact that we had no components specifically for loading data warehouse dimensions, and your mind was made up: Control the workflow with DTS, stage the data in staging tables and create stored procedures to load the dimensions. I suppose you could call it the old-school method.

With SSIS today (SQL Server 2005/2008) the choice is much harder. The tool has improved by leaps and bounds. We now have the ability to cache lookup data, perform much more sophisticated transformations, and above all we have a component dedicated to loading dimensions. These are all very appealing and make the choice so much harder…exponentially harder if you are new to Business Intelligence or SSIS.

So the question remains: Which one should I choose for loading dimensions/facts?

From my experience developing ETL processes for a few large data warehouses, I still believe that T-SQL is the way to go (yes, call me old-school)…and here are the reasons for my argument:

  • The SSIS SCD component was developed for ease of use, but is unfortunately too rigid for most real-life scenarios. If you don’t like to write T-SQL code and only want to tweak a few component properties, the SCD task is exactly what you wanted. And you will probably learn a few things about how slowly changing dimensions work, but that’s unfortunately where it ends. If what you are trying to do is slightly beyond the built-in capabilities of the SSIS component (like adding custom flags to indicate current records etc.), you will spend a lot of time working around the limitations of the component…and probably end up writing a lot of T-SQL code anyways.
  • Row-by-row (RBAR) processing just won’t cut it. Run a trace while processing your dimension through the SSIS component and you will see what I mean. It works pretty well for anything less than a thousand rows, but doesn’t scale beyond that. Most data warehouses are extremely large, and processing a row at a time is far from good enough if you have the power of set-based operations right at your fingertips.
  • Change management is so much easier with stored procedures (and yes, I prefer stored procedures over native T-SQL code in SSIS). Without 3rd-party tools it is almost impossible to see exactly what has changed in an SSIS package. In my opinion at least, SSIS packages require a lot more attention before changes can be deployed…simply because there are so many properties and tasks that need to be reviewed. Well documented stored procedures are more visible/transparent (for lack of a better word), and do not require SSIS packages to be redeployed when changed.
  • Staging data is still a good idea. Troubleshooting ETL failures is a nightmare if you don’t stage your data, especially if you don’t know the source system well or if you do not have direct access to it. The benefit of having a local reference to your source data is also significant. You could still stage the data and use the SCD task, but it seems like many people choose not to when going that route.

 

The above are some of the basic reasons I prefer to control the workflow with SSIS, but use stored procedure calls within the packages to perform all the work when loading dimensions & facts. You may argue that I am oversimplifying things, that you should use a combination of tasks (cached lookup transforms etc.) to get the most out of SSIS and the fact that certain operations could be done while the data is still in memory.

That is of course true, and proves the fact that this debate is too complicated and diverse to conclude within the confines of a single (non short-story) blog post. There are too many scenarios to cover, but the one thing I can say with certainty is that the T-SQL approach has never left me in a dead-end.

My final word of advice in terms of this debate is this: Be sensible. Know what SSIS does well and use that to your advantage. But don’t use SSIS components as an easy way out. The fact that a component is there doesn’t mean it should be used.

T-SQL is the fundamental language of SQL Server, and was designed to perform well with sets of data. Learn how to use it, and you will reap the benefits for years to come.

As a side note, I’ve just finished setting up my schedule for the PASS Summit next month. There will be a session on this topic, so please attend if you want to get some other points of view.

ETL for Data Warehouses – The SSIS vs. T-SQL debate