Friday, 22 February 2013

Data Warehousing Approaches and Data Warehouse Design Approaches


Data Warehouse Design Approaches



Data warehouse design is one of the key technique in building the data warehouse. Choosing a right data warehouse design can save the project time and cost. Basically there are two data warehouse design approaches are popular.

Bottom-Up Design:

In the bottom-up design approach, the data marts are created first to provide reporting capability. A data mart addresses a single business area such as sales, Finance etc. These data marts are then integrated to build a complete data warehouse.  The integration of data marts is implemented using data warehouse bus architecture. In the bus architecture, a dimension is shared between facts in two or more data marts. These dimensions are called conformed dimensions. These conformed dimensions are integrated from data marts and then data warehouse is built.

Advantages of bottom-up design are:
  • This model contains consistent data marts and these data marts can be delivered quickly.
  • As the data marts are created first, reports can be generated quickly.
  • The data warehouse can be extended easily to accommodate new business units. It is just creating new data marts and then integrating with other data marts.

Disadvantages of bottom-up design are:
  • The positions of the data warehouse and the data marts are reversed in the bottom-up approach design.

Top-Down Design:

In the top-down design approach the, data warehouse is built first. The data marts are then created from the data warehouse.

Advantages of top-down design are:
  • Provides consistent dimensional views of data across data marts, as all data marts are loaded from the data warehouse.
  • This approach is robust against business changes. Creating a new data mart from the data warehouse is very easy.

Disadvantages of top-down design are:
  • This methodology is inflexible to changing departmental needs during implementation phase.
  • It represents a very large project and the cost of implementing the project is significant.

Approaches of Data Warehousing:

The requirements of a Data Warehouse solution continually change during its life time. For some organizations you’ll typically expect more changes than others. Think of maturity, history of organizational changes: mergers, acquisitions etc.
You should adopt a solution type that will yield the best value taking these things into account.

Data Warehouse Solutions

When confronted with a new data warehouse solution that I am going to design or redesign I will typically choose between one of the following four approaches:
  1. One-Shot Data Warehouse
  2. Keeping History in Dimensions
  3. Historical / Persistent Staging Area
  4. Data Vault
In this blog post I’ll briefly dive into these approaches and discuss some of the implementation issues.
The choice for a specific approach depends on a number of circumstances, notably:
  • The expected total size of the solution
  • The number of source systems and how the information in them overlap
  • The average expected life-time of source systems: source system volatility
  • Current user requirements
  • Expected number of changes in user requirements: requirements volatility

1. One-Shot Data Warehouse (small- and mid-sized solution)

Current computing possibilities will typically enable the complete build (from scratch) of a data warehouse solution for small- and mid-sized solutions within the typically nightly maintenance window. This can be a very efficient solution with a high return on investment. Some people call it the destroy and rebuild approach since you are removing all previous data from the data warehouse before rebuilding it.
An obvious disadvantage of this approach is that there is no track of changes in the source system: If a customer moves from London to Paris all the historical sales will be reported as Paris sales. Another disadvantage is the sheer amount of data that is transferred every night. That can accumulate to an amount that can not been loaded in the nightly maintenance window.

2. Keeping History in Dimensions

The problem of keeping track of history has been a major issue in data warehousing. In the theories  by Ralph Kimball Slowly Changing Dimensions play an import role. In his books he mentioned some patterns to handle history in dimensions.

Type 1: Change history

For some attributes the business isn’t interested in keeping historical information. This of course is most appropriate when correcting previous errors. Or in other situations where there is no business value in keeping track of historical information. For example when a person’s name changes due to marriage. The rows that were imported earlier in the dimension table are corrected by an UPDATE statement.

Type 2: Keep history

In this pattern history is preserved because a new record is entered for every change in a dimension attribute. The old row is marked as inactive and an end date is added to the row. Any new fact tables records that are inserted after this dimension row is changed will link to the newly added row.

Other types

Type 1 and Type 2 are the most commonly used. However there are some alternatives. They are described in this Wikipedia article.

Other advantages

Besides the tracking of history another major advantage is that you’ll only need changed and new rows (the delta rows) from your source system. And thus diminishing the time you need to load the data in your data warehouse. However getting only these delta rows can be challenging.

3. Historical / Persistent Staging Area

In this approach you’ll typically get a copy of all relevant source tables and add temporal information in a ‘StartDate’ and a ‘EndDate’ column. Again you’ll only process the delta rows: loading new and changed rows. And whenever a row has changed you’ll end date the old row.
Based upon this historical staging area you can adopt method 1 or 2 to load the data in your dimensional model.

What are the reasons for having a persistent staging area?

Well …first of all it could be a demand from auditors or data governance initiatives. Possible driven by external regulations.(Sarbanes-Oxley, Basel I, Basel II, HIPAA, ..)
A second reason which can be tied to the first has to do with keeping control of / reporting on data quality issues in the source systems and thus:
  • Identify possible improvements in processes. (e.g. same data entered twice)
  • Increase / decrease confidence in the data, information and decisions
The third reason has to do with agility. The ability to respond to changes rapidly. Small changes like changing a dimension attribute from type 1 to 2 or adding additional attributes to a dimension. But also large changes like big organizational changes, mergers, new ERP implementations. By having a persistent staging area  it’ll be far easier to respond to these changes and rebuild the data marts (with history). Resulting in quicker, better reporting after such a change.

4. Data Vault

The Data Vault is a special implementation of a persistent staging area. Source tables are split based on column aspects separating the structural items (business keys and the associations between the business keys) from the descriptive attributes.
And thus the same advantages apply to the Data Vault as well. The power of the Data Vault comes from the integration on business keys. This approach is therefore especially suitable if you have source systems with overlapping information and/or if source systems change frequently over time.
This wiki article by Ronald Kunenberg gives a nice overview of the Data Vault Methodology with a lot of additional references.

Conclusion


In this article I discussed some aspects that you should consider when you choose an approach for (re)designing your Data Warehouse. This table summarizes these aspects:

Aspect /  Approach:1234
Simple, fast solutionx   
Only process delta rows xxx
Keep track of historical information xxx
Solve auditing / data governance issues  xx
Control over data quality in source systems  xx
Easily adapt to changing user requirements  xx
Source systems with overlapping information   x
Frequently changing source systems   x

SQL Server, SSIS, SSAS and SSRS on ONE Server


Understanding the difference between SSIS - SQL Server Intigration Service, SSAS-SQL Server Analysis Services and SSRS -SQL Server Reporting Services.


SQL Server 2008 R2 Services Availability in Editions
One key consideration and difference is that not all services come with the different editions of Microsoft SQL Server 2008 R2. The six SQL Server Editions for SQL Server 2008 R2 are Datacenter, Enterprise, Standard, Web, Workgroup and Express. Integration Services is included in the Datacenter and Enterprise editions. Reporting Services is included in all editions of SQL Server 2008 R2. However, full reporting services are limited in the Standard, Web Workgroup and Express editions. Lastly, Analysis Services is available in the Datacenter, Enterprise and Standard editions with advanced analytical functions in the Enterprise and Datacenter editions only. The Standard edition does not include SQL Server PowerPivot for SharePoint that is part of the basic Analysis Services package.

SQL Server Integration Services

The SQL Server Integration Services (SSIS) is the data-warehousing arm of the SQL Server 2008 R2 suite -- equipped with superior Extract, Transform and Load (ETL) capabilities. It provides the vehicle for moving data from different data sources to another and changing the data, if necessary. The three components in the SSIS platform for starting the data integration process are the Import and Export Wizard, SSIS Designer and SSIS API Programming. The Import and Export Wizard simply transfers data from source to destination but does not include data transformation capabilities. The SSIS Designer is an integrated component of the Business Intelligence Development Studio used for the development and maintenance of integration services packages. SSIS API Programming module allows you to code SSIS packages using any number of programming languages.

SQL Server Reporting Services

The SQL Server Reporting Services (SSRS) is a framework of reporting mechanisms such as the Report Builder, Report Designer, Report Manager and Report Server that work together through a Web interface to enable the development of concise interactive reporting solutions in print or Web format. Report Builder and Report Designer are two SSRS components for generating reports. Report Builder is a simple solution for the information worker or business user to create quick reports without the need to understand the core structure of the data. The Report Designer is a tool for developers because it adds complexity to custom reports development. Using this tool requires an understanding of the Business Intelligence Development Studio shell in Visual Studio and the underlying structure of the data. According to Microsoft, the Report Server is the core process engine in SSRS that manages the processing and delivery of reports using processors. The Report Manager is an administrative tool that controls reporting services through a Web interface.

SQL Server Analysis Services

The SQL Server Analysis Services, or SSAS, is a multidimensional analysis tool that features Online Analytical Processing, powerful data mining capabilities, and deeper dimensions to business information within a relational database. Multidimensional analysis is an OLAP technique that produces the ability to analyze large quantities of data by storing data in axes and cells instead of the traditional relational two-dimensional view in rows and columns. SSAS places predictive analytic capabilities in the hands of information workers by creating an instant connection to backend data using familiar applications such as Microsoft Excel and SharePoint for analysis, visual presentation and collaboration.

When SQL Server, SSIS, SSAS and SSRS are on ONE Server:

Best practice dictates that we use a separate server for each of these SQL Server Services. And this seems logical because otherwise these services will compete over server resources. In IT we call this competition: contention.

However there are some great reasons to put these services on one box:
  • Licensing: SQL Server licensing can be expensive. And you need licenses for every server on which a services runs.
  • Better resource utilization: Less servers, less power usage, less maintenance and -monitoring cost.
  • Sometimes the network is the problem as a lot of data moves from the SQL Engine to SSIS or SSAS resulting in network congesting. If services run on the same machine, SQL Server uses the Shared Memory Protocol which is faster and leads to less network congestion.
In a scenario with all SQL/BI Services on one server we need some strategies to diminish the described contention. This is of course a big subject and I can only touch the basics in this post and give you some tips.

TIP 1: Limit the software and services on the server

This seems logical but I have been to several sites where developers remote desktop into the production server to do maintenance- and other jobs. This is certainly not a best practice. Better is to not install any client tools on the server and use a separate developer/test server to do maintenance.

TIP 2: Get an overview of your typical daily workload

image
In this image I have mapped a typical daily workload to the services needed to perform the workload:
  • From 07:00 AM to 23:00 is the extended working day window. Some users start early, others end late: all are extensively opening SSRS reports and querying SSAS cubes.
  • From 23:00 to 01:00 AM backups are running for all (source) systems. This is the period we do our own SSAS and SQL Server backups.
  • From 01:00 AM tot 07:00 AM is our process window
In this period:
  • we use SSIS to land data from our source systems into the staging database
  • we use SSIS to load data from our staging database into the data warehouse
  • we’ll process the dimensions and cubes of our SSAS databases.
  • we warm the cache of our SSAS database and start distributing reports using SSRS subscriptions and SSIS for large Excel reports.
With this knowledge we have a clear understanding of potential contention issues. Furthermore we can schedule to stop and start services on a need to use basis. We can use the operating system commands NET START and NET STOP for this purpose.

TIP 3 Set Maximum Memory Setting for SQL Server

With the max server memory setting you can place a limit to the amount of buffer pool memory used by SQL Server. Which leads to the question to what amount should we limit the buffer pool memory?
During our extended working day window we will probably not benefit from a large buffer pool memory: Most queries will probably be answered by Analysis Services. However in the process windows we will benefit from a larger buffer pool memory. We can change this property during the day and schedule these changes by using the sp_configure system stored procedure. Based on 24GB of RAM, reserve 4 GB for the operating system and:
  • During working day window set max server memory to 4 GB, reserve 8 GB for SSAS and 8 GB for SSRS.
  • During the process window set max server memory to 10 GB, reserving 10GB for SSIS and 8GB for SSAS.

TIP 4 Set SSAS Memory Settings

The SSAS memory settings are available in the msmdsrv.ini file and in properties window of the server. If the value is between 0 and 100 than this means a percentage of total available physical memory. Above 100 means bytes. Change these settings to:
  • Memory\HardMemoryLimit: from the default of 0 to 32
  • Memory\TotalMemoryLimit: from the default of 80 to 28
  • Memory\LowMemoryLimit:from the default of 65 to 24
This means that Analysis Services will start freeing up memory once its has reached the LowMemoryLimit threshold of 24% of physical memory. This process will get more aggressive if it reaches the other thresholds.
While you’re at it change some other memory properties:
  • OLAP\Process\BufferMemoryLimit from 60 to 20
  • OLAP\Process\AggregationMemoryLimitMax from 80 to 14
  • OLAP\Process\AggregationMemoryLimitMin from 10 to 4

TIP 5 Set SSRS Memory Settings

In the RSReportServer.config file add the WorkingSetMaximum property and set it to 8GB (for our 24GB example):
<WorkingSetMaximum>8000000</WorkingSetMaximum>
Restart the reporting services service, so other memory related properties that are based on WorkingSetMaximum get set.

TIP 6 Monitor and adapt the memory setting

Use the settings in tip 3, 4 en 5 as a starting point but monitor memory usage and change these properties to map them to your workload.

TIP 7 Use Windows System Resource Manager to restrict CPU utilization

With Windows System Resource Manager (WSRM) you can create custom resource allocation policies to restrict resources (CPU, RAM) to an application. You can map your workload windows to policies and define these policies e.g.:
  • policy ProcessData which runs when Analysis Services is processing data, set the CPU percentage for SQL to 45% and 45% for SSAS
  • policy ProcessIndex which runs when Analysis Services is processing aggregations, set the CPU percentage for SQL to 10% and 80% for SSAS
wsrm_resource_allocation
Next you can setup calendar rules to apply the different policies to the correct time frames.
However this has the drawback that you have two calendar schemes: one in SQL agent and one in WSRM that you have to keep in sync. Furthermore it’s possible that there aren’t any CPU contention issues. My advice would be to monitor CPU usage and plan resource allocation policies if necessary based on the monitoring results.

TIP 8 Don’t write to the same disk you’re reading from

In the daily process window we will typically load large amount of data in our data warehouse:
  • We load data from the source system in a staging database.
  • From the staging database we load the data in the data warehouse.
  • From the data warehouse we load the Analysis Services cubes.
In this scenario put the data warehouse database files on another drive as the staging database and the Analysis Services cubes.

Conclusion

In this post I focused on diminishing contention issues when using one server for your BI Solution running SQL Server, Analysis Services, Integration Services and Reporting Services. Key aspect is understanding your workload and the role the different applications play as well as the properties you can tweak to diminish contention.

Thursday, 21 February 2013

SQL Server Integration Services (SSIS) Development Environment

SQL Server Integration Services (SSIS) is a tool that we use to perform ETL operations; i.e. extract, transform and load data.  While ETL processing is common in data warehousing (DW) applications, SSIS is by no means limited to just DW; e.g. when you create a Maintenance Plan using SQL Server Management Studio (SSMS) an SSIS package is created.  At a high level, SSIS provides the ability to:
  • retrieve data from just about any source
  • perform various transformations on the data; e.g. convert from one type to another, convert to uppercase or lowercase, perform calculations, etc.
  • load data into just about any source
  • define a workflow
The first version of SSIS was released with SQL Server 2005.  SSIS is a replacement for Data Transformation Services (DTS) which was available with SQL Server 7.0 and SQL Server 2000.  SSIS builds on the capabilities introduced with DTS.
In this tutorial we will step through a number of topics that you need to understand in order to successfully build an SSIS package.  Our high level outline is as follows:
  • Creating SSIS packages with SQL Server Management Studio (SSMS)
  • Business Intelligence Development Studio (BIDS)
  • Creating a simple SSIS package in BIDS
  • Deploying SSIS packages
  • Executing SSIS packages
This article is a bare bones introduction to SQL Server Integration Services (SSIS), with an emphasis on the SSIS development environment. SSIS is Microsoft's full feature extract-transform-load (ETL) tool in SQL Server 2005, 2008, and R2.  SSIS is used to extract data from any of a myriad of sources, such as SQL Server databases, flat files, Excel files, Oracle and DB2 databases, etc. It  performs data manipulation and transformation on the data and delivers it to one or more data destinations.

While SSIS is part of the SQL Server suite of tools, and is very adept at handling SQL Server-to-SQL Server ETL operations, it easily handles many other source and destination you have a library for. Potential applications include scrubbing and exporting Excel data for output to network directories, or third party database ETL transformations (e.g., Oracle-to-Oracle data loads). Let's create a new SSIS package and take a brief look at the major components and features of the SSIS development environment.


Visual Studio vs. BIDS

The SSIS development environment can be installed during a SQL Server installation or independently, i.e., you don't have to have the SQL Server engine installed to develop SSIS packages. The SSIS development environment consists of the Business Intelligence Development Studio (BIDS), a full-fledged Visual Studio installation with components specific to SSIS and other elements comprising SQL Server business intelligence. If you already have Visual Studio 2008 installed on your computer, the SSIS installation will install a separate BIDS application and add components to your existing Visual Studio. You can develop in either environment, as they are identical.

Important Panes

Like all files in Visual Studio, all SSIS packages are part of a VS project, which is in turn part of a solution. To create a new project and a new SSIS package:
  1. Start BIDS or Visual Studio
  2. Select File > New > Project
  3. Select "Integration Services Project" from Business Intelligence Projects
  4. Name the project and select the location. Click OK.
New Project Window in Visual Studio/BIDS
(Above: New Project Window in Visual Studio/BIDS)
The new project will display a new, blank package. Your most valuable windows are:
  • Toolbox - Contains the elements you can add to your package. The contents of the Toolbox change depending on what main window tab you're in.
  • Main Window - The development window. Each tab shows a different level or focus of the development. Today we are only interested in two of those tabs.
    • Control Flow tab - Contains control flow elements. Control flow is anything that moves in the package - anything that causes you to go from one step to another. Examples include moving or deleting a file, creating indexes, etc.
    • Data Flow tab - Data flow is anything that physically moves data. All sources, destinations, and transformations are included in data flow.
  • Connection Managers - In SSIS, there is a single connection manager for each connection.  When you create a data source, you assign it a connection manager. After that, if the connection manager changes, all data sources that use it are affected as well. That wasn't the case in DTS (the ETL tool prior to SQL Server 2005); if you had to change a connection, you had to go to each and every task and change it there individually.
  • Solution Explorer - This displays the Package elements - the package itself, data sources, packages, etc.
  • Properties - This is a context sensitive menu that displays the properties for whatever is currently selected, from the package itself down to individual components.
In the Solution Explorer, right-click and rename your package to something descriptive, like "MyFirstPackage.dtsx".  Answer Yes to the popup window "Do you want to rename the package object as well?"  You can also add existing packages to your project, or import packages from the file system or from a SQL Server.
SSIS Main Window in Visual Studio/BIDS

Tabs: Control Flow vs. Data Flow

In DTS (the SQL Server 2000 predecessor to SSIS), control flow and data flow were the same thing, which caused a lot of problems - especially with sequencing tasks. In SSIS, control flow and data flow are viewed and controlled separately, using two tabbed windows. Control flowis anything that controls or moves in the package - anything that causes you to go from one step to another. A few examples include loops (like For and For Each), scripting and FTP, send mail, creating indexes, or data profiling.  Notice that "Data Flow Task" is an element in the Control Flow's Toolbox pane. Data flow is anything that physically moves and manipulates the data in the SSIS pipeline (as opposed to running a SQL command that runs on the SQL Server). Data sources, destinations, and transformations in the data flow allow you to:
  • pull data from one or many disparate sources
  • convert data types
  • clean up "dirty" data
  • split or merge data from one or more sources
  • and much more.

FAQ - SSIS Interview Questions


What is SQL Server Integration Services (SSIS)?

  • SQL Server Integration Services (SSIS) is component of SQL Server 2005 and later versions. SSIS is an enterprise scale ETL (Extraction, Transformation and Load) tool which allows you to develop data integration and workflow solutions. Apart from data integration, SSIS can be used to define workflows to automate updating multi-dimensional cubes and automating maintenance tasks for SQL Server databases.
  • This is a platform for Integration and Workflow applications. It is known for a fast and flexible OLTP and OLAP extensions used for data extraction, transformation, and loading (ETL). The tool may also be used to automate maintenance of SQL Server databases and multidimensional data sets.

How does SSIS differ from DTS?

  • SSIS is a successor to DTS (Data Transformation Services) and has been completely re-written from scratch to overcome the limitations of DTS which was available in SQL Server 2000 and earlier versions. A significant improvement is the segregation of the control/work flow from the data flow and the ability to use a buffer/memory oriented architecture for data flows and transformations which improve performance.
Data Transformation Services
SQL Server Integration Services
Limited Error Handling
Complex and powerful Error Handling
Message Boxes in ActiveX Scripts
Message Boxes in .NET Scripting 
No Deployment Wizard
Interactive Deployment Wizard
Limited Set of Transformation
Good number of Transformations
NO BI functionality
Complete BI Integration


What is a workflow in SSIS ?
  • Workflow is a set of instructions on to specify the Program Executor on how to execute tasks and containers within SSIS Packages.


What is the Control Flow?

  • When you start working with SSIS, you first create a package which is nothing but a collection of tasks or package components.  The control flow allows you to order the workflow, so you can ensure tasks/components get executed in the appropriate order.


What is the Data Flow Engine?

  • The Data Flow Engine, also called the SSIS pipeline engine, is responsible for managing the flow of data from the source to the destination and performing transformations (lookups, data cleansing etc.).  Data flow uses memory oriented architecture, called buffers, during the data flow and transformations which allows it to execute extremely fast. This means the SSIS pipeline engine pulls data from the source, stores it in buffers (in-memory), does the requested transformations in the buffers and writes to the destination. The benefit is that it provides the fastest transformation as it happens in memory and we don't need to stage the data for transformations in most cases.


What is a Transformation?

  •  A transformation simply means bringing in the data in a desired format. For example you are pulling data from the source and want to ensure only distinct records are written to the destination, so duplicates are  removed.  Anther example is if you have master/reference data and want to pull only related data from the source and hence you need some sort of lookup. There are around 30 transformation tasks available and this can be extended further with custom built tasks if needed.


What is a Task?

  • A task is very much like a method of any programming language which represents or carries out an individual unit of work. There are broadly two categories of tasks in SSIS, Control Flow tasks and Database Maintenance tasks. All Control Flow tasks are operational in nature except Data Flow tasks. Although there are around 30 control flow tasks which you can use in your package you can also develop your own custom tasks with your choice of .NET programming language.

What is a Precedence Constraint and what types of Precedence Constraint are there?

  • SSIS allows you to place as many as tasks you want to be placed in control flow. You can connect all these tasks using connectors called Precedence Constraints. Precedence Constraints allow you to define the logical sequence of tasks in the order they should be executed. You can also specify a condition to be evaluated before the next task in the flow is executed.
  • These are the types of precedence constraints and the condition could be either a constraint, an expression or both 
    • Success (next task will be executed only when the last task completed successfully) or
    • Failure (next task will be executed only when the last task failed) or
    • Complete (next task will be executed no matter the last task was completed or failed).

What is a container and how many types of containers are there?

  • A container is a logical grouping of tasks which allows you to manage the scope of the tasks together.
  • These are the types of containers in SSIS:
    • Sequence Container - Used for grouping logically related tasks together
    • For Loop Container - Used when you want to have repeating flow in package
    • For Each Loop Container - Used for enumerating each object in a collection; for example a record set or a list of files.
  • Apart from the above mentioned containers, there is one more container called the Task Host Container which is not visible from the IDE, but every task is contained in it (the default container for all the tasks).

What are variables and what is variable scope?

  • A variable is used to store values. There are basically two types of variables, System Variable (like ErrorCode, ErrorDescription, PackageName etc) whose values you can use but cannot change and User Variable which you create, assign values and read as needed. A variable can hold a value of the data type you have chosen when you defined the variable.
  • Variables can have a different scope depending on where it was defined. For example you can have package level variables which are accessible to all the tasks in the package and there could also be container level variables which are accessible only to those tasks that are within the container.
How many difference source and destinations have you used?
  •  It is very common to get all kinds of sources so the more the person worked with the better for you. Common ones are SQL Server, CSV/TXT, Flat Files, Excel, Access, Oracle, MySQL but also Salesforce, web data scrapping.

What configuration options have you used?
  •  This is an important one. Configuration should always be dynamic and usually is done using XML and/or Environment Variable and SQL Table with all configurations.

How do you apply business rules in SSIS (Transformations….Specific calculations but also cleansing)?
  •  Some people use SSIS only to extract data and then go with stored procedures only….they are usually missing the point of the power of SSIS. Which allows to create "a flow" and on each step applies certain rules this greatly simplifies the ETL process and simplicity is very good.

 How to quickly load data into sql server table?
  •  Fast Load option. This option is not set by default so most developers know this answer as otherwise the load is very slow.

Give example of handling data quality issues?
  •  Data Quality is almost always a problem and SSIS handles it very well. Examples include importing customers from different sources where customer name can be duplicates. For instance you can have as company name: SQL Server Business Intelligence but also SQL Server BI or SQL Server BI LTD or SQL Server BI Limited or intelligence (with one l). There are different ways to handle it. Robust and time consuming is to create a table with or possible scenarios and update it after each update. You can also use fuzzy grouping which is usually easy to implement and will make usually very good decisions but it is not 100% accurate so this approach has to be justified. Other typical quality issues are nulls (missing values), outliers (dates like 2999 or types like 50000 instead of 5000 especially important if someone is adjusting the value to get bigger bonus), incorrect addresses and these are either corrected during ETL, ignored, re-directed for further manual updates or it fails the packages which for big processes is usually not practised.

 When to use Stored Procedures?
  •  This was one of the requested question in comment (at the bottom of the page). This one is very important but also tricky. ALL SSIS developers have SQL Server background and that is sometime not very good if they use SQL not SSIS approach.
Let's start with when you typically use SPs. This is for preparing tables (truncate), audit tasks (usually part of SSIS framework), getting configuration values for loops and a few other general tasks.
During ETL extract you usually type simple SQL because it comes from other sources and usually over complication is not a good choice (make it dynamic) because any changes usually affect the package which has to be updated as well.
During Transformation phase (business rules, cleaning, core work) you should use Transformation tasks not Stored procedures! There are loads of tasks that make the package much easier to develop but also a very important reason is readability which is very important for other people who need to change the package and obviously it reduces risks of making errors. Performance is usually very good with SSIS as it is memory/flow based approach. So when to use Stored Procedures for transformations? If you don't have strong SSIS developers or you have performance reasons to do it. In some cases SPs can be much faster (usually it only applies to very large datasets). Most important is have reasons which approach is better for the situation.

What is your approach for ETL with data warehouses (how many packages you developer during typical load etc.)?
  • This is rather generic question. A typical approach (for me) when building ETL is to. Have a package to extract data per source with extract specific transformations (lookups, business rules, cleaning) and loads data into staging table. Then a package do a simple merge from staging to data warehouse (Stored Procedure) or a package that takes data from staging and performs extra work before loading to data warehouse. I prefer the first one and due to this approach I occasionally consider having extract stage (as well as stage phase) which gives me more flexibility with transformation (per source) and makes it simpler to follow (not everything in one go). So to summarize you usually have package per source and one package per data warehouse table destination. There are might be other approach valid as well so ask for reasons.

Tuesday, 19 February 2013

SSIS - Top 10 SQL Server Integration Services Best Practices

“Does your system need to scale beyond 4.5 million sales transaction rows per second?” 

SQL Server Integration Services is a high performance Extract-Transform-Load (ETL) platform that scales to the most extreme environments. And as documented in SSIS ETL world record performance, SQL Server Integration Services can process at the scale of 4.5 million sales transaction rows per second.


1.  SSIS is an in-memory pipeline, so ensure that all transformations occur in memory.
The purpose of having Integration Services within SQL Server features is to provide a flexible, robust pipeline that can efficiently perform row-by-row calculations and parse data all in memory.

While the extract and load phases of the pipeline will touch disk (read and write respectively), the transformation itself should process in memory. If transformations spill to disk (for example with large sort operations), you will see a big performance degradation. Construct your packages to partition and filter data so that all transformations fit in memory.

A great way to check if your packages are staying within memory is to review the SSIS performance counter Buffers spooled, which has an initial value of 0; above 0 is an indication that the engine has started swapping to disk. 

2.  Plan for capacity by understanding resource utilization.
SQL Server Integration Services is designed to process large amounts of data row by row in memory with high speed. Because of this, it is important to understand resource utilization, i.e., the CPU, memory, I/O, and network utilization of your packages.

CPU Bound
Seek to understand how much CPU is being used by Integration Services and how much CPU is being used overall by SQL Server while Integration Services is running. This latter point is especially important if you have SQL Server and SSIS on the same box, because if there is a resource contention between these two, it is SQL Server that will typically win – resulting in disk spilling from Integration Services, which slows transformation speed. 

The perfmon counter that is of primary interest to you is Process / % Processor Time (Total). Measure this counter for both sqlservr.exe and dtexec.exe. If SSIS is not able to drive close to 100% CPU load, this may be indicative of: 
  • Application contention: For example, SQL Server is taking on more processor resources, making them unavailable to SSIS.
  • Hardware contention: A common scenario is that you have suboptimal disk I/O or not enough memory to handle the amount of data being processed.
  • Design limitation: The design of your SSIS package is not making use of parallelism, and/or the package uses too many single-threaded tasks.

Network Bound
SSIS moves data as fast as your network is able to handle it. Because of this, it is important to understand your network topology and ensure that the path between your source and target have both low latency and high throughput. 

The following Network perfmon counters can help you tune your topology:
  • Network Interface / Current Bandwidth: This counter provides an estimate of current bandwidth.
  • Network Interface / Bytes Total / sec: The rate at which bytes are sent and received over each network adapter.
  • Network Interface / Transfers/sec: Tells how many network transfers per second are occurring. If it is approaching 40,000 IOPs, then get another NIC card and use teaming between the NIC cards.
These counters enable you to analyze how close you are to the maximum bandwidth of the system. Understanding this will allow you to plan capacity appropriately whether by using gigabit network adapters, increasing the number of NIC cards per server, or creating separate network addresses specifically for ETL traffic.

I/O Bound
If you ensure that Integration Services is minimally writing to disk, SSIS will only hit the disk when it reads from the source and writes to the target. But if your I/O is slow, reading and especially writing can create a bottleneck.

Remember that an I/O system is not only specified by its size ( “I need 10 TB”) – but also by its sustainable speed (“I want 20,000 IOPs”).

Memory bound
A very important question that you need to answer when using Integration Services is: “How much memory does my package use?”

The key counters for Integration Services and SQL Server are:
  • Process / Private Bytes (DTEXEC.exe) – The amount of memory currently in use by Integration Services. This memory cannot be shared with other processes.
  • Process / Working Set (DTEXEC.exe) – The total amount of allocated memory by Integration Services.
  • SQL Server: Memory Manager / Total Server Memory: The total amount of memory allocated by SQL Server. Because SQL Server has another way to allocate memory using the AWE API, this counter is the best indicator of total memory used by SQL Server. To understand SQL Server memory allocations better, refer toSlava Ok’s Weblog.
  • Memory / Page Reads / sec – Represents to total memory pressure on the system. If this consistently goes above 500, the system is under memory pressure.
3.  Baseline source system extract speed.
Understand your source system and how fast you extract from it. After all, Integration Services cannot be tuned beyond the speed of your source – i.e., you cannot transform data faster than you can read it.
Measure the speed of the source system by creating a very simple package reading data from your source with the a destination of “Row Count”:
Execute the package from the command line (DTEXEC) and measure the time it took for it to complete its task. Use the Integration Services log output to get an accurate calculation of the time. You want to calculate rows per second:
Rows / sec = Row Count / TimeData Flow
Based on this value, you now know the maximum number of rows per second you can read from the source – this is also the roof on how fast you can transform your data. To increase this Rows / sec calculation, you can do the following:

  • Improve drivers and driver configurations: Make sure you are using the most up-to-date driver configurations for your network, data source, and disk I/O. Often the default network drivers on your server are not configured optimally for the network stack, which results in performance degradations when there are a high number of throughput requests. Note that for 64-bit systems, at design time you may be loading 32-bit drivers; ensure that at run time you are using 64-bit drivers.
     
  • Start multiple connections: To overcome limitations of drivers, you can try to start multiple connections to your data source. As long as the source can handle many concurrent connections, you may see an increase in throughput if you start several extracts at once. If concurrency is causing locking or blocking issues, consider partitioning the source and having your packages read from different partitions to more evenly distribute the load.
     
  • Use multiple NIC cards: If the network is your bottleneck and you’ve already ensured that you’re using gigabit network cards and routers, then a potential solution is to use multiple NIC cards per server. Note that you will have to be careful when you configure multiple NIC environments; otherwise you will have network conflicts.

  • 4.  Optimize the SQL data source, lookup transformations, and destination.
    When you execute SQL statements within Integration Services (as noted in the above Data access mode dialog box), whether to read a source, to perform a look transformation, or to change tables, some standard optimizations significantly help performance:
    •  Use the NOLOCK or TABLOCK hints to remove locking overhead.
    • To optimize memory usage, SELECT only the columns you actually need. If you SELECT all columns from a table (e.g., SELECT * FROM) you will needlessly use memory and bandwidth to store and retrieve columns that do not get used. .
       
    • If possible, perform your datetime conversions at your source or target databases, as it is more expensive to perform within Integration Services..
       
    • In SQL Server 2008 Integration Services, there is a new feature of the shared lookup cache. When using parallel pipelines (see points #8 and #10 below), it provides a high-speed, shared cache. .
       
    • If Integration Services and SQL Server run on the same server, use the SQL Server destination instead of the OLE DB destination to improve performance..
       
    • Commit size 0 is fastest on heap bulk targets, because only one transaction is committed. If you cannot use 0, use the highest possible value of commit size to reduce the overhead of multiple-batch writing.  Commit size = 0 is a bad idea if inserting into a Btree – because all incoming rows must be sorted at once into the target Btree—and if your memory is limited, you are likely to spill.  Batchsize = 0 is ideal for inserting into a heap. For an indexed destination, I recommend testing between 100,000 and 1,000,000 as batch size.
       
    • Use a commit size of <5000 to avoid lock escalation when inserting; note that in SQL Server 2008 you can now enable/disable lock escalation at the object level, but use this wisely.
       
    • Heap inserts are typically faster than using a clustered index. This means that you may want to drop indexes and rebuild if you are changing a large part of the destination table; you will want to test your inserts both by keeping indexes in place and by dropping all indexes and rebuilding to validate..
       
    • Use partitions and partition SWITCH command; i.e., load a work table that contains a single partition and SWITCH it in to the main table after you build the indexes and put the constraints on..

    5.  Tune your network.
    A key network property is the packet size of your connection. By default this value is set to 4,096 bytes. This means a new network package must be assemble for every 4 KB of data. As noted in SqlConnection.PacketSize Property in the .NET Framework Class Library, increasing the packet size will improve performance because fewer network read and write operations are required to transfer a large data set.
    If your system is transactional in nature, with many small data size read/writes, lowering the value will improve performance.

    Since Integration Services is all about moving large amounts of data, you want to minimize the network overhead. This means that the value 32K (32767) is the fastest option. While it is possible to configure the network packet size on a server level using sp_configure, you should not do this. The database administrator may have reasons to use a different server setting than 32K. Instead, override the server settings in the connection manager as illustrated below.
    Another network tuning technique is to use network affinity at the operating system level. At high throughputs, you can sometimes improve performance this way.

    For the network itself, you may want to work with your network specialists to enable jumbo frames to increase the default payload of 1,500 bytes to 9,000 bytes. By enabling jumbo frames, you will further decrease the amount of network operation required to move large data sets.

    6.  Use data types – yes, back to data types! –wisely.
    Of all the points on this top 10 list, this is perhaps the most obvious. Yet, it is such an important point that it needs to be made separately. Follow these guidelines:
    •  Make data types as narrow as possible so you will allocate less memory for your transformation.
       
    • Do not perform excessive casting of data types – it will only degrade performance. Match your data types to the source or destination and explicitly specify the necessary data type casting..
       
    • Watch precision issues when using the money, float, and decimal types. Also, be aware the money is faster than decimal, and money has fewer precision considerations than float
    7.  Change the design.
    There are some things that Integration Services does well – and other tasks where using another tool is more efficient. Your tool choice should be based on what is most efficient and on a true understanding of the problem. To help with that choice, consider the following points:
    •  Do not sort within Integration Services unless it is absolutely necessary. In order to perform a sort, Integration Services allocates the memory space of the entire data set that needs to be transformed. If possible, presort the data before it goes into the pipeline. If you must sort data, try your best to sort only small data sets in the pipeline. Instead of using Integration Services for sorting, use an SQL statement with ORDER BY to sort large data sets in the database – mark the output as sorted by changing the Integration Services pipeline metadata on the data source.
       
    • There are times where using Transact-SQL will be faster than processing the data in SSIS. As a general rule, any and all set-based operations will perform faster in Transact-SQL because the problem can be transformed into a relational (domain and tuple) algebra formulation that SQL Server is optimized to resolve. Also, the SQL Server optimizer will automatically apply high parallelism and memory management to the set-based operation – an operation you may have to perform yourself if you are using Integration Services. Typical set-based operations include:
       
      • Set-based UPDATE statements - which are far more efficient than row-by-row OLE DB calls.
         
      • Aggregation calculations such as GROUP BY and SUM. These are typically also calculated faster using Transact-SQL instead of in-memory calculations by a pipeline.
         
    • Delta detection is the technique where you change existing rows in the target table instead of reloading the table. To perform delta detection, you can use a change detection mechanism such as the new SQL Server 2008 Change Data Capture (CDC) functionality. If such functionality is not available, you need to do the delta detection by comparing the source input with the target table. This can be a very costly operation requiring the maintenance of special indexes and checksums just for this purpose. Often, it is fastest to just reload the target table. A rule of thumb is that if the target table has changed by >10%, it is often faster to simply reload than to perform the logic of delta detection. 
    8.  Partition the problem.
    One of the main tenets of scalable computing is to partition problems into smaller, more manageable chunks. This allows you to more easily handle the size of the problem and make use of running parallel processes in order to solve the problem faster.

    For ETL designs, you will want to partition your source data into smaller chunks of equal size. This latter point is important because if you have chunks of different sizes, you will end up waiting for one process to complete its task. For example, looking at the graph below, you will notice that for the four processes executed on partitions of equal size, the four processes will finish processing January 2008 at the same time and then together continue to process February 2008. But for the partitions of different sizes, the first three processes will finish processing but wait for the fourth process, which is taking a much longer time. The total run time will be dominated by the largest chunk.

    To create ranges of equal-sized partitions, use time period and/or dimensions (such as geography) as your mechanism to partition. If your primary key is an incremental value such as an IDENTITY or another increasing value, you can use a modulo function. If you do not have any good partition columns, create a hash of the value of the rows and partition based on the hash value.
    Some other partitioning tips:
    • Use partitioning on your target table. This way you will be able to run multiple versions of the same package, in parallel, that insert data into different partitions of the same table. When using partitioning, the SWITCH statement is your friend. It not only increases parallel load speeds, but also allows you to efficiently transfer data. Please refer to the SQL Server Books Online article Transferring Data Efficiently by Using Partition Switching for more information.
       
    • As implied above, you should design your package to take a parameter specifying which partition it should work on. This way, you can have multiple executions of the same package, all with different parameter and partition values, so you can take advantage of parallelism to complete the task faster.
       
    • From the command line, you can run multiple executions by using the “START” command.
    9.  Minimize logged operations.
    When you insert data into your target SQL Server database, use minimally logged operations if possible. When data is inserted into the database in fully logged mode, the log will grow quickly because each row entering the table also goes into the log.

    Therefore, when designing Integration Services packages, consider the following:
    • Try to perform your data flows in bulk mode instead of row by row. By doing this in bulk mode, you will minimize the number of entries that are added to the log file. This reduction will improve the underlying disk I/O for other inserts and will minimize the bottleneck created by writing to the log.
       
    • If you need to perform delete operations, organize your data in a way so that you can TRUNCATE the table instead of running a DELETE. The latter will place an entry for each row deleted into the log. But the former will simply remove all of the data in the table with a small log entry representing the fact that the TRUNCATE occurred. In contrast with popular belief, a TRUNCATE statement can participate in a transaction.
       
    • Use the SWITCH statement and partitioning. If partitions need to be moved around, you can use the SWITCH statement (to switch in a new partition or switch out the oldest partition), which is a minimally logged statement.
       
    • Be careful when using DML statements; if you mix in DML statements within your INSERT statements, minimum logging is suppressed.
    10.  Schedule and distribute it correctly.
    After your problem has been chunked into manageable sizes, you must consider where and when these chunks should be executed. The goal is to avoid one long running task dominating the total time of the ETL flow.

    A good way to handle execution is to create a priority queue for your package and then execute multiple instances of the same package (with different partition parameter values). The queue can simply be a SQL Server table. Each package should include a simple loop in the control flow:
    1.  Pick a relevant chunk from the queue:
      1. “Relevant” means that is has not already been processed and that all chunks it depends on have already run.
      2. If no item is returned from the queue, exit the package.
    2. Perform the work required on the chunk.
    3. Mark the chunk as “done” in the queue.
    4. Return to the start of loop.
    Picking an item from the queue and marking it as “done” (step 1 and 3 above) can be implemented as stored procedure, for example.

    The queue acts as a central control and coordination mechanism, determining the order of execution and ensuring that no two packages work on the same chunk of data. Once you have the queue in place, you can simply start multiple copies of DTEXEC to increase parallelism.