Skip to content
May 16, 2012

Let’s GO

Going back one year, when I worked in China, the scene we argued on the exact use of “GO” in TSQL is still clear till today.

This is a tiny corner of SQL Server statement however I reckon most of us may not have a solid understanding on this. Here I would use this post to clarify the GO statement in TSQL.

  • “GO” is to seperate the statements into batches.
  • Every batch is a set of commands submitted to SQL server DB engine.
  • Furthermore, Batches determine variable scope and name resolution.

(not easy to understand, maybe easy, think about you delcare one variable in the beginnig, which is repeatedly used in all this page. If inserting any row “GO” in the middle, which means you announce to finish the scope of the variable using onwards.)   

Well, Let’s “GO”. Take yours aways.

 

March 13, 2012

SSIS Lookup or T-SQL Join

Background

In BI solution, it is quite usual to fetch the dimension surrogate key(s) from the dimension table(s) into the fact table to satisfy the requirement of star-schema design in data warehouse system. And the possible approaches to achive this basically could be either to choose “Lookup” component in SSIS or T-SQL “JOIN”. This paper examines both approaches from some different prespecetives. And it suggests what is the best practice to use SSIS lookup in ETL solution.

 

Tow Approaches Comparison

Overview

The Lookup transformation performs lookups by joining data in input columns with columns in a reference dataset. You use the lookup to access additional information in a related table that is based on values in common columns.

TSQL Join, generally speaking, serves the exactly same functionality to extend the columns in projection list horizontally.

Capabilities

Both methods can meet the basic requirement in terms of the most common feature–fetching the surrogate keys from the dimension tables.

However, in some very special cases, it only drops into one not the other. SSIS can well handle even the reference dataset coming from a heterogeneous database like text file, DB2 etc. On the other hand, Neelesh points out that TSQL can cope with more complicated business logic especially in the case of dependency between two datasets.

  

Performance

As both of the implementaion can satisfy the most situations, then the question is which one can help us gain better performance. Although it really depends on many factors and could be pretty complicated, some thumb rules most are on the same page and we can follow or we need to reconsider use instead of  only by your preference are listed in below table.

 

 

SSIS

SQL Join

Isolated and Powerful SSIS Server

R

 

DB server shared for multiple applications

R

 

Key Lookup crossing DB instances

HR

 

Error out No-Match rows  

R

 

Re-using dimension tables by catch connection

HR

 

SSIS buffer cannot hold all dimension data(>1M rows)

 

R

Blocking(or say “asynchronous”) components used like “Sort”, “aggregation”

 

R

Data loading based on the complicated dependency

 

HR

 

(R: Recommended; HR: Highly Recommended)

 

Ease of development/maintenance

More often than not, it is believed the source table for the fact table locates in the stageDB and dimension tables should be in the EDW DB. If using JOIN, what you have to do is to hard code the DB name or say to use corss-DB join. On the other hand, via lookup in SSIS solution, you can easily configure the DB connections to solve this problem, which makes the design more loose-coupled and open to extension for the case DB name changing or even DB migrate to the other server.

Some may think, however, if the business logic change, SQL join can save you re-deploy the package. What they need to do is to apply the updated SQL JOIN(if it is a proc in DB). This is the credits that SQL Join(or say procedure) has.

  

Other Considerations

In addition to the points underscoring above, there might be some other concerns.Local regulation is one of them. Dallas Team contributes a good example that the customer “XXX” requests to wrap the business logic outside of SSIS and put more logics close to DB layer as much as possible. The other  is like the developers’ skill set. If developers are quite competitive in T-SQL and limited knowledge in SSIS lookup, SQL Join is an option under the circumstance of no performcen spike.

 

Recommendation

As a BI solution, SSIS Lookup is the first choice with the exception of performance issues or other strong concern like customers’ company standard.

The other piece of reminding is, if you choose to use SQL JOIN finanly, that seriously conside to index well on the source table and conside the indexing in your whole ETL solution compreshensively. Drop them when loading data and put them back when joining.

Best Practice When Using “Lookup” in SSIS

#1, “Full Catch” mode can dramatically improve performance.(The only disadvantage by this way is case-sensitive, which might not be fit for your ETL real situation.)

How to achieve this for looking up range value?—Using an assistant view to flat the reference data  then still choose the way of “Full Catch”

 

 

#2, SORT ahead before lookup does NOT help.

Putting sort(order by) in TSQL or using “SORT” component does NOT bring us benefit. On the other hand, it  blocks the data flow.

 

 

#3, Do NOT output the unnecessary columns to downstream.(what are uncessary columns? Once key fetch, the join column in fact source side more often than not does NOT need to flow to downstream again unless it is used for next lookup or partially introduced into calculation logic for the downsteam) By this way, Less memory is held up by the ETL execution tree(thread).

 

#4, Using TSQL to fetch reference dataset instead of to choose a table a view

 

 

#5, Setup the option for no match as “Redirect rows to no match output” and handle these “no match” either assign a dummy key or error out.

 

 

#6, Cautious to use “Catch Connection”, by which, the performance of ETL could be upgraded once again however it can NOT reflect the new values in the persisted catahced file.If your dimension table values keep still and multiple fact tables need to lookup this one, “Catch Connection” is your friend.

MSDN “How to” topic http://msdn.microsoft.com/en-us/library/bb895289.aspx

 

 

Reference Documents

#1, Inequality joins, Asynchronous transformations and Lookups : SSIS

http://sqlblog.com/blogs/jamie_thomson/archive/2010/05/14/inequality-joins-asynchronous-transformations-and-lookups-ssis.aspx

#2, SSIS: Asynchronous and synchronous data flow components

http://consultingblogs.emc.com/jamiethomson/archive/2005/02/11/SSIS_3A00_-Asynchronous-and-synchronous-data-flow-components.aspx

 

#3, When to use T-SQL and SSIS

http://www.jamesserra.com/archive/2011/08/when-to-use-t-sql-or-ssis-for-etl/

 

February 1, 2012

Debate on Null or NOT Null in data warehouse system

This is always an interesting topic in DB design. Today, I would present my point of view of NULL or NOT null in OLAP system(Here basically I mean the syetem of data warehouse instead of OLTP).

If you have reviewed some threads online about this debate, you might tend to believe or tend to be convinced that NULL values should be forbidden to appear in EDW.

Personally, It depends however. Before I discuss furthur, I would seperate the scenarios into four buckets. First, foreign key in fact table; Secondly, attributes in fact table; Next, the measures in fact table; and the last not the least, the attributes in dimension table.

For the 1st case, foreign keys in fact table, if they can not match a value in the dimension tables, which value assigned to them is most appropriate? My answer, -1 or -2, whatever you predefined in your dimension. It is generally agreed on this point for the vast majority of DWH developers.

Secondly, attributes in fact tables. I prefer “NULL” for this kin dof cases. Imagine,  for manufactory EDW, if the order is shipped, this order basicaly has a delivery number, however, the tarcking information of the delivery information is not the focusing for the manufactory company or they dont have this ability to hold this tracking information(No tin their system) like which date the goods arrives in Nevada and which date leaves US. Meanwhile, we still wanna put this nformation for this order for some purposes. If you pull all shipped orders into EDW, ideally this value should NOT be a NULL. Yet, more often than not, if you carry over all orders which are not shipped , they dont ahve this values, which is understandable. So, to me, I would assign NULL to this case. 

 The third one, which is the most important one in my mind, how to assign a valid value to the measure in the fact table. Someone raised a very good example about the topic(check out this http://kimballgroup.forumotion.net/t277-null-values-in-facts-yes-or-no) It talks about the grades for the courses. Some courses has from 1 to 10 grades but the other ones only count as Pass/Failed. So the question is that which value will be assigned to the column “grades” if they are the Pass/Fail courses. Once agin, my short answer is NULL. Most would argue, if 1 to 10 is a valid value, why not 0 or -1? The reason is that this is a measure column, 0 or -1 any certainty value misleads the aggregation(Not only thinking about SUM but others like AVG).  Try the query below, I believe it will give you a more comfortable reason to believe that NULL has its values .

with E as
(
select a=5
union all
select a=6
union all
select a=null
union all
select a=7
)
select AVG(a) from E

 ;with E as
(
select a=5
union all
select a=6
union all
select a=0
union all
select a=7
)
select AVG(a) from E

The last one is about the attribute in dimension tables. Honestly, I dont wanna offer an arbitariry answer right now. More professionals lean to use the “NA” or something to substitute the “NULL” here for the cnditions if this value is not maintained in source system. This makes sense to me if you think NULL in this case would make business people confusing when using the CUBE. Hum, yes, maybe yes. But I am not totally against using NULL in this case

 

 

 

 

December 30, 2011

Get the Major;Minor;Build Number for ETL packages

I was assigned one task to maintain all version numbers for our 90 packages. Possible?! I doubted it until I found this way.

Think about this, SSIS package is a XML-based file and SQL server works tightly with XML type. This is the basical thought.
By the way , if you deploy all packages into SQL server intead of file system you can fetch the version too however it is not the one in BIDS.

Back! How to fetch the version numbers from the packages deployed in the file system. Here you go,

sp_configure 'show advanced options', 1;
reconfigure;
GO
sp_configure 'xp_cmdshell', 1;
reconfigure;
GO

use tempdb
go
DECLARE @Path VARCHAR(2000);
SET @Path = 'C:\Sandbox\*.dtsx';

DECLARE @MyFiles TABLE (MyID INT IDENTITY(1,1) PRIMARY KEY, FullPath VARCHAR(2000));
DECLARE @CommandLine VARCHAR(4000) ;

SELECT @CommandLine =LEFT('dir "' + @Path + '" /A-D /B /S ',4000);
INSERT INTO @MyFiles (FullPath)
EXECUTE xp_cmdshell @CommandLine;
DELETE
FROM @MyFiles
WHERE FullPath IS NULL
OR FullPath='File Not Found'
OR FullPath = 'The system cannot find the path specified.'
OR FullPath = 'The system cannot find the file specified.';

IF EXISTS (select * from sys.tables where name = N'pkgStats')
DROP TABLE pkgStats;
CREATE TABLE pkgStats(
PackagePath varchar(900) NOT NULL PRIMARY KEY
, PackageXML XML NOT NULL
);

DECLARE @FullPath varchar(2000);
DECLARE file_cursor CURSOR
FOR SELECT FullPath FROM @MyFiles;
OPEN file_cursor
FETCH NEXT FROM file_cursor INTO @FullPath;
WHILE @@FETCH_STATUS = 0
BEGIN
declare @sql nvarchar(max);
SET @sql = '
INSERT pkgStats (PackagePath,PackageXML)
select ''@FullPath'' as PackagePath
, cast(BulkColumn as XML) as PackageXML
from openrowset(bulk ''@FullPath'',
single_blob) as pkgColumn';
SELECT @sql = REPLACE(@sql, '@FullPath', @FullPath);
EXEC sp_executesql @sql;

FETCH NEXT FROM file_cursor INTO @FullPath;
END
CLOSE file_cursor;
DEALLOCATE file_cursor;

SELECT SUBSTRING(PackagePath,LEN(PackagePath) - CHARINDEX('\',REVERSE(PackagePath),0)+2,LEN(PackagePath)) AS PackageName
, PackagePath
, CreatorName
, PackageFormatVersion
, PackageType
, PackageDescription
, VersionMajor
, VersionMinor
, VersionBuild
, VersionGUID
, COUNT(*) AS NumberOfTasks
FROM (
select PackagePath
, PackageXML.value('declare namespace DTS="www.microsoft.com/SqlServer/Dts";
/DTS:Executable[1]/DTS:Property[@DTS:Name=''CreatorName''][1]','nvarchar(500)') AS CreatorName
, CAST(PackageXML.value('declare namespace DTS="www.microsoft.com/SqlServer/Dts";
/DTS:Executable[1]/DTS:Property[@DTS:Name=''PackageFormatVersion''][1]','varchar(3)') AS smallint) AS PackageFormatVersion
, CAST(PackageXML.value('declare namespace DTS="www.microsoft.com/SqlServer/Dts";
DTS:Executable[1]/@DTS:ExecutableType[1]','varchar(50)') AS varchar(50)) AS PackageType
, PackageXML.value('declare namespace DTS="www.microsoft.com/SqlServer/Dts";
/DTS:Executable[1]/DTS:Property[@DTS:Name=''Description''][1]','nvarchar(2000)') AS PackageDescription
, CAST(PackageXML.value('declare namespace DTS="www.microsoft.com/SqlServer/Dts";
/DTS:Executable[1]/DTS:Property[@DTS:Name=''VersionMajor''][1]','varchar(3)') AS smallint) AS VersionMajor
, CAST(PackageXML.value('declare namespace DTS="www.microsoft.com/SqlServer/Dts";
/DTS:Executable[1]/DTS:Property[@DTS:Name=''VersionMinor''][1]','varchar(3)') AS smallint) AS VersionMinor
, CAST(PackageXML.value('declare namespace DTS="www.microsoft.com/SqlServer/Dts";
/DTS:Executable[1]/DTS:Property[@DTS:Name=''VersionBuild''][1]','varchar(3)') AS smallint) AS VersionBuild
, CAST(PackageXML.value('declare namespace DTS="www.microsoft.com/SqlServer/Dts";
/DTS:Executable[1]/DTS:Property[@DTS:Name=''VersionGUID''][1]','char(38)') AS char(38)) AS VersionGUID
, PackageXML
from pkgStats
) p
CROSS APPLY p.PackageXML.nodes('declare namespace DTS="www.microsoft.com/SqlServer/Dts";
//DTS:Executable[@DTS:ExecutableType!=''STOCK:SEQUENCE''
and @DTS:ExecutableType!=''STOCK:FORLOOP''
and @DTS:ExecutableType!=''STOCK:FOREACHLOOP''
and not(contains(@DTS:ExecutableType,''.Package.''))]') Pkg(props)
GROUP BY PackagePath
, CreatorName
, PackageFormatVersion
, PackageType
, PackageDescription
, VersionMajor
, VersionMinor
, VersionBuild
, VersionGUID;

December 30, 2011

SSIS-Lookup-Partial Catch Sucks!

It is obvious that I am lazy. Not any update on my blog in the past one month.

Well, here I only wanna remind all SSIS developer to take considerration very seriously to use the “partial catch” in Lookup component in SSIS.

My expierence learnt from the project informs me this fact. It costs 9 hours to finsh loading 12 milion records to go through some 40 lookup component, one of which uses “partial catch”. By contrast, if changing to full catch mode, it only takes 40 miinutes! Amazing~~it is true. I am extremely surprised by this.

Technically, why I would try the “partial catch”? Here is the case. There is a dimension table with a low boundary(min value) and a high boundary(max value) conbined as a business key. As it goes, you wanna fetch the surrogate key for your fatc data which flows into the pipeline. So what can we do? if using faull catch, you lose the chance to modify the TSQL and you can not do a range compare. One of options is to use the partial catch then modif y the TSQL something like this select * from (select * from...) where ?>=minValue and ?=O.MinNumberOfDays and H.num<=O.MaxNumberOfDaysHowever, this is not good option for the enterprise data warehouse.

Alternatively, it is highly suggested to use the full catch. How? using CTE or a “number” table to exapnd the range as row by row. Looks like below TSQL.(Personaly, I like the way of the table one.)
select num=convert(int,num ), OrderAging_Key from
(select top 10000 rOW_NUMBER() over (order by O1.object_id)-2000 as 'Num'
from sys.objects O1 , sys.objects O2, sys.objects O3, sys.objects O4) as H inner join dbo.Dim_OrderAging O
on H.num>=O.MinNumberOfDays and H.num<=O.MaxNumberOfDays

Hope this thread helps.

November 21, 2011

Dynamic ETL Framework

If you have bunch of tables to copy from application Source DB(s) to landing place(or say ODS layer),even you need to move the data from ODS to stage and stage to DWH. This framework may help you some to a certain degree.

The purpose of this framework is to mitigate the work in creating almost-duplicate ETL packages. Instead, you can have more time to smooth the business requirement.

Take a closer look. Supposing there are a1, a2,…an tables waiting for you to load them into ODS. Normally, you are requested to create one package for all or you create one package for one table accordingly to load the data from source to target. If this is the reality you are experiencing, check out this framework.

The idea of framework is easy to understand. If we have a table to maintain the source table name and its destination place/name even source columns to destination columns, then this dynamic framework can go through every table mapping and load the data according to your configuration in the config table. What you gain is to save the time in developing/unittesting/maintaining the packages. What you need to focus isto understand the business requirement and configure the mapping relationship in the right place and tell the framework which loading strategy you wanna implement flush-fill or SCD I…

Here are two possible ways of implementation. One is to use the SQL server procedures, which is pretty SQL intensive. The other is to leverage the SSIS interface in .Net framework to dynamic generate the packages and execute them.

The first version(SQL intensive, single thread) is in coding and will be released as estimated on Nov. 25th, 2011.
Tables schema attached. Feel free to check this out.

July 24, 2011

Move DB files

For some reasons, you may need to change DB file location from one to another.
As i can think of, there are at least 3 wways.
1.detach DB, move files, attach
2.backup/restore(indicate a new location when restoring)
3.This is a way I prefer to use but seems it does not have a formal name.
scripts for your reference
use USSut003
go
–tell SQL server, you wanna change the physical place for one file(could be log or data)
ALTER DATABASE USSut003
MODIFY FILE (NAME = ‘USSut003_PG_DAT_1′, FILENAME = ‘J:\BIILAND\MSSQL\DATA\USSut003.MDF’)

use master
go

ALTER DATABASE USSut003
SET offline

–Physical move data from one place to another

ALTER DATABASE USSut003
SET online

All done.

Follow

Get every new post delivered to your Inbox.