Debate on Null or NOT Null in data warehouse system

This is always an interesting topic in DB design. Today, I would present my point of view of NULL or NOT null in OLAP system(Here basically I mean the syetem of data warehouse instead of OLTP).

If you have reviewed some threads online about this debate, you might tend to believe or tend to be convinced that NULL values should be forbidden to appear in EDW.

Personally, It depends however. Before I discuss furthur, I would seperate the scenarios into four buckets. First, foreign key in fact table; Secondly, attributes in fact table; Next, the measures in fact table; and the last not the least, the attributes in dimension table.

For the 1st case, foreign keys in fact table, if they can not match a value in the dimension tables, which value assigned to them is most appropriate? My answer, -1 or -2, whatever you predefined in your dimension. It is generally agreed on this point for the vast majority of DWH developers.

Secondly, attributes in fact tables. I prefer “NULL” for this kin dof cases. Imagine,  for manufactory EDW, if the order is shipped, this order basicaly has a delivery number, however, the tarcking information of the delivery information is not the focusing for the manufactory company or they dont have this ability to hold this tracking information(No tin their system) like which date the goods arrives in Nevada and which date leaves US. Meanwhile, we still wanna put this nformation for this order for some purposes. If you pull all shipped orders into EDW, ideally this value should NOT be a NULL. Yet, more often than not, if you carry over all orders which are not shipped , they dont ahve this values, which is understandable. So, to me, I would assign NULL to this case. 

 The third one, which is the most important one in my mind, how to assign a valid value to the measure in the fact table. Someone raised a very good example about the topic(check out this http://kimballgroup.forumotion.net/t277-null-values-in-facts-yes-or-no) It talks about the grades for the courses. Some courses has from 1 to 10 grades but the other ones only count as Pass/Failed. So the question is that which value will be assigned to the column “grades” if they are the Pass/Fail courses. Once agin, my short answer is NULL. Most would argue, if 1 to 10 is a valid value, why not 0 or -1? The reason is that this is a measure column, 0 or -1 any certainty value misleads the aggregation(Not only thinking about SUM but others like AVG).  Try the query below, I believe it will give you a more comfortable reason to believe that NULL has its values .

with E as
(
select a=5
union all
select a=6
union all
select a=null
union all
select a=7
)
select AVG(a) from E

 ;with E as
(
select a=5
union all
select a=6
union all
select a=0
union all
select a=7
)
select AVG(a) from E

The last one is about the attribute in dimension tables. Honestly, I dont wanna offer an arbitariry answer right now. More professionals lean to use the “NA” or something to substitute the “NULL” here for the cnditions if this value is not maintained in source system. This makes sense to me if you think NULL in this case would make business people confusing when using the CUBE. Hum, yes, maybe yes. But I am not totally against using NULL in this case

 

 

 

 

Advertisements