ETL for Data Warehouses – The SSIS vs. T-SQL debate

I see many questions about using the built-in components in SSIS versus writing T-SQL code to perform dimension/fact updates, especially type 2 dimension changes (changes that track history).

In the old DTS days (SQL Server 2000), the choice was pretty simple and straight-forward. DTS (Data Transformation Services) was a great tool for controlling workflow and automating the environment, but it was terrible at performing in-memory and/or set-based operations. Add to that the fact that we had no components specifically for loading data warehouse dimensions, and your mind was made up: Control the workflow with DTS, stage the data in staging tables and create stored procedures to load the dimensions. I suppose you could call it the old-school method.

With SSIS today (SQL Server 2005/2008) the choice is much harder. The tool has improved by leaps and bounds. We now have the ability to cache lookup data, perform much more sophisticated transformations, and above all we have a component dedicated to loading dimensions. These are all very appealing and make the choice so much harder…exponentially harder if you are new to Business Intelligence or SSIS.

So the question remains: Which one should I choose for loading dimensions/facts?

From my experience developing ETL processes for a few large data warehouses, I still believe that T-SQL is the way to go (yes, call me old-school)…and here are the reasons for my argument:

  • The SSIS SCD component was developed for ease of use, but is unfortunately too rigid for most real-life scenarios. If you don’t like to write T-SQL code and only want to tweak a few component properties, the SCD task is exactly what you wanted. And you will probably learn a few things about how slowly changing dimensions work, but that’s unfortunately where it ends. If what you are trying to do is slightly beyond the built-in capabilities of the SSIS component (like adding custom flags to indicate current records etc.), you will spend a lot of time working around the limitations of the component…and probably end up writing a lot of T-SQL code anyways.
  • Row-by-row (RBAR) processing just won’t cut it. Run a trace while processing your dimension through the SSIS component and you will see what I mean. It works pretty well for anything less than a thousand rows, but doesn’t scale beyond that. Most data warehouses are extremely large, and processing a row at a time is far from good enough if you have the power of set-based operations right at your fingertips.
  • Change management is so much easier with stored procedures (and yes, I prefer stored procedures over native T-SQL code in SSIS). Without 3rd-party tools it is almost impossible to see exactly what has changed in an SSIS package. In my opinion at least, SSIS packages require a lot more attention before changes can be deployed…simply because there are so many properties and tasks that need to be reviewed. Well documented stored procedures are more visible/transparent (for lack of a better word), and do not require SSIS packages to be redeployed when changed.
  • Staging data is still a good idea. Troubleshooting ETL failures is a nightmare if you don’t stage your data, especially if you don’t know the source system well or if you do not have direct access to it. The benefit of having a local reference to your source data is also significant. You could still stage the data and use the SCD task, but it seems like many people choose not to when going that route.

 

The above are some of the basic reasons I prefer to control the workflow with SSIS, but use stored procedure calls within the packages to perform all the work when loading dimensions & facts. You may argue that I am oversimplifying things, that you should use a combination of tasks (cached lookup transforms etc.) to get the most out of SSIS and the fact that certain operations could be done while the data is still in memory.

That is of course true, and proves the fact that this debate is too complicated and diverse to conclude within the confines of a single (non short-story) blog post. There are too many scenarios to cover, but the one thing I can say with certainty is that the T-SQL approach has never left me in a dead-end.

My final word of advice in terms of this debate is this: Be sensible. Know what SSIS does well and use that to your advantage. But don’t use SSIS components as an easy way out. The fact that a component is there doesn’t mean it should be used.

T-SQL is the fundamental language of SQL Server, and was designed to perform well with sets of data. Learn how to use it, and you will reap the benefits for years to come.

As a side note, I’ve just finished setting up my schedule for the PASS Summit next month. There will be a session on this topic, so please attend if you want to get some other points of view.

Advertisements
ETL for Data Warehouses – The SSIS vs. T-SQL debate

9 thoughts on “ETL for Data Warehouses – The SSIS vs. T-SQL debate

  1. How about just using SQL agent jobs and stored procs and skipping SSIS altogether? I suppose you can’t really script cube processing in a stored proc though right?

    1. Good question. SSIS has additional functionality which comes in very handy when developing ETL processes.

      Some of these functions not easily recreated with SQL Agent are:

      1. The ability to create complex workflows. It’s not always as simple as redirecting when an error occurs.
      2. Comprehensive error handling, by means of the event handlers.
      3. The integration of .NET scripts into the workflow, which gives you a lot of flexibility.
      4. With SSIS you can run tasks in parallel.
      5. File system operations are much easier with the built-in tasks in SSIS.
      6. Variables and configurations give you the ability to make SSIS packages portable and dynamic, something which is not easy to replicate in SQL Agent.

      I firmly believe that the best tool for the job should be used in all cases. SQL Agent is a good scheduling tool with limited workflow capability…but it is definitely not an ETL or workflow tool.

      1. Good answer! I appreciate you going to the trouble of laying out those reasons.

        I’ve worked in ssrs mostly and less in seas and just starting to do some said work now. I guess I’ve seen some very bad DWs. One I am working with is part of MS Dynamics, and in 10 months I’ve watched it go from taking 4 hours to taking 8. Without really having looked at the packages I don’t know why but you know, set based SQL that is engine friendly should scale.
        From your article I might guess there are some SCD tasks that are not set based and causing the longer running time. I wonder if unchecked it might end up taking more than 24 hours to run. Amusing for a daily job.

        The consultants we outsource the said work to are scratching their heads over the reason too. It’s not like the DB or the DW are getting massive either.

        I might try and dissect some of the packages and see for myself. You’ve inspired me!

      2. Thanks for the kind words.

        Generally, the use of the SCD component in SSIS is highly discouraged. It does not scale well nor does it perform set-based operations. My suggestion before even looking at the SSIS packages: Update statistics and defrag indexes. I cannot tell you how many data warehouses I’ve seen that neglected maintenance…and that is often part of the reason for the poor performance.

  2. Good article :>). I prefer combination of both T-SQL and SSIS. Usually we run both database instance and ETL on the same box and it is always preferred to use set based queries. But since SSIS provided these transformation components they have to use it :(. I have worked with people on both sides where people just use SSIS transformation for everything which could have been done with T-SQL. I have seen ETLs built which looks like a market…

    I believe it’s always comes down to what each individual likes instead of using the best tool for the tasks.

    1. Thanks for the feedback. Personal preference (and skills for that matter) definitely plays a significant role in the way we develop solutions, and for the most part I don’t think that’s a problem. To use a component or transformation just because it exists, even if it performs badly is where the problem lies in my opinion. I guess a lot of it can be attributed to inexperience or ignorance…or both.

  3. pc says:

    I have worked in DW for many years and I totally agree with using TSQL for processing data on the same server. Using ETL tool such as Informatica and SSIS for all the transformation/validation/cleaning..etc for complicated process/logic will create a huge diagram in SSIS which is difficult to debug and maintain.
    Performance: Set/Row. For example: I converted an Informatica process from 4 hours to less than 20 minute using TSQL
    Functinality: Besides slow SCD performance in SSIS, TSQL provides more functionalities such as Partition Over, StdDev..etc in TSQL to manipulate data that I don’t think that SSIS can handle. I use both in my current project.
    Maintainabliltiy: In my current SSIS package, there are @10-15 task in average of which 6-7 tasks are used for audit and logging and the rest of tasks are to call stored procedures. If I have to convert all processes in stored procedure to SSIS tasks, it will end up over 40-50 tasks in SSIS. I will get dizzy at looking at 40-50 tasks in SSIS before I do any work. 🙂
    Having said this, SSI S will be a better choice for cross-server process or file system/FTP interface process ..etc

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s