Column Store Index in SQL2012 Part(I)

Introduction on SQL Server 2012 “ColumnStore” Index(Part I)
Overview
Is “ColumnStore” a new idea? No, it has been introduced even back to 1970s. Is this new for SQL Server Database? Yes, it is newly introduced into SQL Server database engine in this version (SQL Server 2012). This new idea combining with other enhancements in algorithm and execution plan can dramatically accelerate the queries in data warehouse. Furthermore, this technology can make the whole business intelligence solutions Microsoft provides more attractive.
Concept of “ColumnStore” Index
“ColumnStore” is the concept comparing with the traditional storage mode “row-store”, which stores data horizontally or say row by row. However, “ColumnStore” is the idea to store the data on the pages allocated vertically or say column by column.
For example, you have the table with data as below.
SalesOrderNumber ProductKey OrderDateKey CustomerKey PromotionKey
SO43697 310 20010701 21768 1
SO43698 346 20010701 28389 1
SO43699 346 20010701 25863 1
SO43700 336 20010701 14501 1
SO43701 346 20010701 11003 1
SO43702 311 20010702 27645 1
SO43703 310 20010702 16624 1
SO43706 312 20010703 27621 1
The way using row-store, it goes like the figure 1 (Simplified)

(Figure1: Row-Store)
On the contrary, the “ColumnStore” looks like below (Simplified)

(Figure2: Column-Store)
This idea is not that complicated and now we are thinking what this simple idea can bring us.
How it helps on EDW queries?
Notice or not? If storing data vertically, namely “ColumnStore”, there are much more duplicated values on every single page than the way of “RowStore”. Based on this, as part of “ColumnStore” implementation, compression is hired. As a result, one page with the same size 8K can store much more data relatively comparing with “Row-Store”. By doing so, I/O writes and I/O reads will greatly decrease, which eventually improve the performance.
Let’s think about a typical data warehouse query below.
select S.StoreName,P.ProductName, Sales=Sum(SalesAmount)
from [dbo].[FactOnlineSales] F
inner join dimStore S on F.StoreKey=S.StoreKey
inner join dimProduct P on F.ProductKey=P.ProductKey
where P.className=’Economy’ and S.OpenDate>=’2004-01-01′
group by S.StoreName,P.ProductName
order by Sum(SalesAmount)
(Table1: EDW Query)
This typical EDW query answers the business question that describes as “show me the sales amount breaking down by store name and productname on the condition that product’s classname is ‘Economy’ and the store opendate is greater than the year of 2004”. If using the “Row-store”, probably DB engine has to scan all the pages from the beginning to the end to filter out the rows which doesn’t meet the condition. On the other hand, if “column store” plugged in this table, it is enough to scan the pages, on which hold the data of columns “StoreKey, productKey, salesAmount” (Let’s focus on the fact table because normally we have less concern on the dimension tables). Therefore, when SQL Server copes with this query, it only needs to pop up a couple of “Green” pages from disk to memory as shown in figure3 and the overhead scanning on “Blue” pages can be saved.

(Figure3: Fewer Pages Scanning)
In addition to scanning fewer pages, if using the column store index on the table, the execution engine also improves on the parallelism algorithm and introduces a new execution element “Batch Hash Table Build”(The detailed discussion will be covered in future topics).
Performance Illustration
In this section, the query performance will be compared and examined under the cases with “ColumnStore” index, common index and no index. All testing and data collecting are based on my laptop (Intel Core i7 @2.8 GHz; 4G RAM; 64 Bit Win7). The tables are shown as below diagram, in which the fact table (FactOnlineSales) holds 12.6 Million rows.

(Figure 4: EDW Diagram)
First of all, Let’s create the “ColumnStore” index on the fact table “FactOnlineSales”. Creating “ColumnStore” index is nothing special but one more keyword “Columnstore” is needed as below.
–It runs in 1’29s on my machine
CREATE NONCLUSTERED COLUMNSTORE INDEX [NonClusteredColumnStoreIndex_Demo]
ON [dbo].[FactOnlineSales]
(
[OnlineSalesKey],
[DateKey],
[StoreKey],
[ProductKey],
[PromotionKey],
[CurrencyKey],
[CustomerKey],
[SalesOrderNumber],
[SalesOrderLineNumber],
[SalesQuantity],
[SalesAmount],
[ReturnQuantity],
[ReturnAmount],
[DiscountQuantity],
[DiscountAmount],
[TotalCost],
[UnitCost],
[UnitPrice],
–[ETLLoadID],
–[LoadDate],
[UpdateDate]
)WITH (DROP_EXISTING = OFF) ON [PRIMARY]
(Table2: Creating “ColumnStore” Index)
As you have noticed that the “columnstore” index can cover all the columns, which could potentially be introduced to answer the business questions, but it is still suggested to knock off the columns that business users don’t care for example, “LoadDate”, very probably your business users have no interested to analyze the sales sliced by “loaddate”. Removing this kind of columns out of your “ColumnStore” index not only saves index space but also reduces the index-creating duration. The duration we create this column store index based on 12.6 Million rows lasts 90 seconds.
First of all, before we run any query, let’s clear the SQL Server Catch by the query in table3 and turn on the IO reads message printing by the query in table4, which can help us fair enough to compare two queries not only by the duration indicator but also the “reads”.
–Always clean up the catch/buffer
DBCC DROPCLEANBUFFERS
DBCC FREEPROCCACHE
(Table3: Cleanup the catch/buffer)
–Open IO read/show execution Plan
set statistics IO on
(Table4: Turn on IO Reads message)
After this setting up, let’s run the query in the table5. As the “columnstore” index is on place, at this time, the query running can well leverage the “columnstore” index, which will be shown to you by switching on the execution plan in the next section.
–Duration: =’2004-01-01′
group by S.StoreName,P.ProductName
order by Sum(SalesAmount)

(Table5: Query with “ColumnStore” index)
It is critical that we notice this query runs less than 1 second and the reads totally hit around 6.4K bytes. Moving on, run this query again by ignoring the “ColumnStore” as below.
Duration: 14s
(2643 row(s) affected)
Table ‘DimStore’. Scan count 5, logical reads 28, physical reads 1, read-ahead reads 3, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table ‘DimProduct’. Scan count 5, logical reads 127, physical reads 1, read-ahead reads 43, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table ‘FactOnlineSales’. Scan count 5, logical reads 44310, physical reads 1, read-ahead reads 44017, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table ‘Worktable’. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
(1 row(s) affected)
*/
select S.StoreName,P.ProductName, Sales=Sum(SalesAmount)
from [dbo].[FactOnlineSales] F
inner join dimStore S on F.StoreKey=S.StoreKey
inner join dimProduct P on F.ProductKey=P.ProductKey
where P.className=’Economy’ and S.OpenDate>=’2004-01-01′
group by S.StoreName,P.ProductName
order by Sum(SalesAmount)
option(ignore_nonclustered_columnstore_index)
(Table6: Query ignoring “ColumnStore” index)
Surprisingly, it is notable that running the same query ignoring the “ColumnStore” index makes the reads up to 44K bytes and the duration increases dramatically to 14 seconds. It is greatly apparent how the “ColumnStore” index helps to accelerate the typical data warehouse queries. Now, you may remind me whether creating the common indexes on the columns ”StoreKey” and “productKey” helps or not. The answer is quite clear “NO” because the common indexes cannot help for low selective query. I also did the test and data shows the IO reads go around 43K bytes and duration is nothing better at all.
Thus, below table can well wrap up the performance illustration comparing the different index design strategies we did. The whitepaper Microsoft published points out, in their test, the performance with the “ColumnStore” index can help to speed up to 455 times in some cases.

My Results Reads Duration
ColumnStore 6271 0.342 s
No ColumnStore (Or common indexes) 44310 12.377 s
Speedup >7X >36X
(Table7: Performance Illustration)
As I mentioned before, some enhanced execution elements(Batch hash Table Build and Enhanced parallelism) are introduced combining with the existence of the “ColumnStore”, which shows in figure5.

Conclusion
In this paper (Part I), it has been proved that the “ColumnStore” index can largely improve the data warehouse queries performance, which makes the whole business intelligence solution more attractive for end users and saves developers productivity greatly on performance tuning. In the coming part II, it will cover what typical scenarios we should consider to use the “ColumnStore” indexes, what cases we might need to avoid using and also , which is incredibly important, what is the consideration on loading data if there is a “ColumnStore” index enabled on the table.