SSIS Lookup or T-SQL Join

Background

In BI solution, it is quite usual to fetch the dimension surrogate key(s) from the dimension table(s) into the fact table to satisfy the requirement of star-schema design in data warehouse system. And the possible approaches to achive this basically could be either to choose “Lookup” component in SSIS or T-SQL “JOIN”. This paper examines both approaches from some different prespecetives. And it suggests what is the best practice to use SSIS lookup in ETL solution.

 

Tow Approaches Comparison

Overview

The Lookup transformation performs lookups by joining data in input columns with columns in a reference dataset. You use the lookup to access additional information in a related table that is based on values in common columns.

TSQL Join, generally speaking, serves the exactly same functionality to extend the columns in projection list horizontally.

Capabilities

Both methods can meet the basic requirement in terms of the most common feature–fetching the surrogate keys from the dimension tables.

However, in some very special cases, it only drops into one not the other. SSIS can well handle even the reference dataset coming from a heterogeneous database like text file, DB2 etc. On the other hand, Neelesh points out that TSQL can cope with more complicated business logic especially in the case of dependency between two datasets.

  

Performance

As both of the implementaion can satisfy the most situations, then the question is which one can help us gain better performance. Although it really depends on many factors and could be pretty complicated, some thumb rules most are on the same page and we can follow or we need to reconsider use instead of  only by your preference are listed in below table.

 

 

SSIS

SQL Join

Isolated and Powerful SSIS Server

R

 

DB server shared for multiple applications

R

 

Key Lookup crossing DB instances

HR

 

Error out No-Match rows  

R

 

Re-using dimension tables by catch connection

HR

 

SSIS buffer cannot hold all dimension data(>1M rows)

 

R

Blocking(or say “asynchronous”) components used like “Sort”, “aggregation”

 

R

Data loading based on the complicated dependency

 

HR

 

(R: Recommended; HR: Highly Recommended)

 

Ease of development/maintenance

More often than not, it is believed the source table for the fact table locates in the stageDB and dimension tables should be in the EDW DB. If using JOIN, what you have to do is to hard code the DB name or say to use corss-DB join. On the other hand, via lookup in SSIS solution, you can easily configure the DB connections to solve this problem, which makes the design more loose-coupled and open to extension for the case DB name changing or even DB migrate to the other server.

Some may think, however, if the business logic change, SQL join can save you re-deploy the package. What they need to do is to apply the updated SQL JOIN(if it is a proc in DB). This is the credits that SQL Join(or say procedure) has.

  

Other Considerations

In addition to the points underscoring above, there might be some other concerns.Local regulation is one of them. Dallas Team contributes a good example that the customer “XXX” requests to wrap the business logic outside of SSIS and put more logics close to DB layer as much as possible. The other  is like the developers’ skill set. If developers are quite competitive in T-SQL and limited knowledge in SSIS lookup, SQL Join is an option under the circumstance of no performcen spike.

 

Recommendation

As a BI solution, SSIS Lookup is the first choice with the exception of performance issues or other strong concern like customers’ company standard.

The other piece of reminding is, if you choose to use SQL JOIN finanly, that seriously conside to index well on the source table and conside the indexing in your whole ETL solution compreshensively. Drop them when loading data and put them back when joining.

Best Practice When Using “Lookup” in SSIS

#1, “Full Catch” mode can dramatically improve performance.(The only disadvantage by this way is case-sensitive, which might not be fit for your ETL real situation.)

How to achieve this for looking up range value?—Using an assistant view to flat the reference data  then still choose the way of “Full Catch”

 

 

#2, SORT ahead before lookup does NOT help.

Putting sort(order by) in TSQL or using “SORT” component does NOT bring us benefit. On the other hand, it  blocks the data flow.

 

 

#3, Do NOT output the unnecessary columns to downstream.(what are uncessary columns? Once key fetch, the join column in fact source side more often than not does NOT need to flow to downstream again unless it is used for next lookup or partially introduced into calculation logic for the downsteam) By this way, Less memory is held up by the ETL execution tree(thread).

 

#4, Using TSQL to fetch reference dataset instead of to choose a table a view

 

 

#5, Setup the option for no match as “Redirect rows to no match output” and handle these “no match” either assign a dummy key or error out.

 

 

#6, Cautious to use “Catch Connection”, by which, the performance of ETL could be upgraded once again however it can NOT reflect the new values in the persisted catahced file.If your dimension table values keep still and multiple fact tables need to lookup this one, “Catch Connection” is your friend.

MSDN “How to” topic http://msdn.microsoft.com/en-us/library/bb895289.aspx

 

 

Reference Documents

#1, Inequality joins, Asynchronous transformations and Lookups : SSIS

http://sqlblog.com/blogs/jamie_thomson/archive/2010/05/14/inequality-joins-asynchronous-transformations-and-lookups-ssis.aspx

#2, SSIS: Asynchronous and synchronous data flow components

http://consultingblogs.emc.com/jamiethomson/archive/2005/02/11/SSIS_3A00_-Asynchronous-and-synchronous-data-flow-components.aspx

 

#3, When to use T-SQL and SSIS

http://www.jamesserra.com/archive/2011/08/when-to-use-t-sql-or-ssis-for-etl/

 

Published by

Admin

focusing DB, Big Data and BI tech.

2 thoughts on “SSIS Lookup or T-SQL Join”

Leave a reply to Tieme Cancel reply