# Tuesday, June 30, 2009

ssisdataflowsample Whilst doing some design work today for a customer project I realised there are a set of principals I try and adhere to when creating SQL Server Integration Services packages. The list is no doubt incomplete but this is what I have so far.

Minimise IO

This is a general data processing principal. Usually disk and, to a lesser extent, network performance determine the overall processing speed. Reducing the amount of IO in a solution will therefore increase performance.

Solutions that consist of multiple read-process-write steps should be redesigned into a single read-process-process-process-write step.

Prefer Sequential IO to Random IO

Disks perform at their best when sequentially reading or writing large chunks of data. Random IO (and poor performance) manifests when procedural style programming occurs - signs to look out for are SQL statements modifying/returning only few rows but being executed repeatedly.

Watch out for hidden random IO - for example, if you are reading from one table and writing to another in a sequential manor then disk access will still be random if both tables are stored on the same spindles.

Avoid data flow components that pool data

Data flow components work on batches of data called buffers. In most instances buffers are modified in place and passed down stream. Some components, such as "Sort" cannot process data like this and effectively hang on to buffers until the entire data stream is in memory (or spooled to disk in low memory situations). This increased memory pressure will affect performance.

Sometimes SQL is the better solution

Whilst the SSIS data flow has lots of useful and flexible components, it is sometimes more efficient to perform the equivalent processing in a SQL batch. SQL Server is extremely good at sorting, grouping and data manipulation (insert, update, delete) so it is unlikely you will match it for raw performance on a single read-process-write step.

SSIS does not handle hierarchical data well

Integration Services is a tabular data processing system. Buffers are tabular and the components and associated APIs are tabular. Consequently it is difficult to process hierarchical data such as the contents of an XML document. There is an XML source component but it's output is a collection of tabular data streams that need to joined to make sense.

Execute SSIS close to where you wish to write your data

Reading data is relatively easy and possible from a wide variety of locations. Writing data, on the other hand, can involve complex locking and other issues which are difficult to optimise on a network protocol. In particular when writing data to a local SQL Server instance, SSIS automatically used the Shared Memory transport for direct inter-process transfer.

Don't mess with the data flow metadata at runtime

It's very difficult to do this anyway but worth mentioning that SSIS gets it's stellar performance from being able to setup a data flow at runtime safe in the knowledge that buffers are of a fixed format and component dependencies will not change.

The only time this is acceptable is when you need to build a custom data flow programmatically. You should use the SSIS API's and not attempt to write the package XML directly.

This posting is provided "AS IS" with no warranties, and confers no rights.
posted on Tuesday, June 30, 2009 7:23:10 PM (GMT Daylight Time, UTC+01:00)  #    Comments [0] Trackback
# Thursday, April 16, 2009

Every data warehouse needs a date dimension and at some point it needs to be populated. Most use some sort of a SQL script that loops though the dates and add rows to the destination table but this is pretty slow to execute. You might even try cross joining a year, month and day temporary tables to produce a set based solution but don’t forget to filter out the illegal days.

I prefer to fill my date tables by generating the correct stream of values from a SQL Server Integration Services script source component. This has a number of benefits:

  • It executes very quickly
  • The data can be bulk loaded
  • CultureInfo supplies the correct translations of day and month names
  • It is easy to add custom columns such as fiscal years and quarters

I haven’t wrapped this in a pre-compiled component as it is so easy to do in script from. Also, I haven’t got around to generalizing the fiscal date offsets for different companies so they usually have to be custom coded.

Script Component Type Dialog

First drop a “Script Component” onto your Data Flow.

Select “Source” as the Script Component Type and click OK.

Then double-click the newly added component to edit the properties.

Note that you need to add the correct output columns before adding the script or else it won’t compile.


Script Source Outputs

I’ve renamed the output here to “Dates” to help further down the Data Flow.

Click the “Add Column” button to add new columns as show here. Note that I’ve also changed the data type of each column to match my source table. It required casts in script but it’s easier than conversions in the data pipeline.

Finally go back to the script part of the dialog and click the “Edit Script” button to launch Visual Studio for Applications.

In the resulting window, add your code to generate the date stream to the CreateNewOutputRows() function.
The general form is of:

var output = this.DatesBuffer;  // Get the output buffer

while (/*loop though your dates*?)
{

output.AddRow();

// Set the various column values e.g.
output.CalendarYear = date.Year

// Increment the date
date = date.AddDays(1);
}

The full script is in the attached sample package where I’ve also added a script destination that does nothing with the data. Attach a data viewer to see what output is generated.

Date Data

From here you can manipulate the data, and pipe it to your dimension table from within the pipeline.

DateSourceSample.zip (27.08 KB)

This posting is provided "AS IS" with no warranties, and confers no rights.
posted on Thursday, April 16, 2009 7:52:34 PM (GMT Daylight Time, UTC+01:00)  #    Comments [0] Trackback
DateSourceSample.zip (27.08 KB)
# Tuesday, February 24, 2009

One of the big problems I have with some new customers is the knowledge gap between where they are and where they need to be in order to be successful. Given that people can read much faster than they can receive presented information it makes sense to have a reading list.

I’ve blogged previously some Analysis Services resources but new and better content is appearing all the time. In addition to that article here are my current reading recommendations for anyone planning a BI or Analysis Services project. Know this lot backwards and you should have a good head start.

Everyone

SQL Performance Tuning with Waits and Queues

Analysis Services Performance Guide

Identifying and Resolving MDX Query Performance Bottlenecks in SQL Server 2005 Analysis Services

The Data Loading Performance Guide

Report Server Catalog Best Practices

Reporting Services Performance Optimizations

New Data Warehouse Scalability Features in SQL Server 2008

Scaling Up Your Data Warehouse with SQL Server 2008

Architects and Developers

OLAP Design Best Practices for Analysis Services 2005

Best Practices for Data Warehousing with SQL Server 2008

Analysis Services Processing Best Practices

Top 10 SQL Server Integration Services Best Practices

Top 10 Best Practices for Building a Large Scale Relational Data Warehouse

Operations and Support

Storage Top 10 Best Practices

SQL Server Best Practices

IIS Performance Tuning

Resolving Common Connectivity Issues in SQL Server 2005 Analysis Services Connectivity Scenarios

In fact pretty much anything on the sqlcat website is pure gold.

This posting is provided "AS IS" with no warranties, and confers no rights.
posted on Tuesday, February 24, 2009 10:15:39 AM (GMT Standard Time, UTC+00:00)  #    Comments [0] Trackback
# Monday, December 22, 2008

There are no hard and fast rules but the goal is to reduce the time taken to extract data from a source system and reduce the amount of work you have to do with the extracted data. The numbers quoted here are the ones I use as a starting point but you need to measure to determine the best values.

Don't do an incremental extract if:

  • There isn't much data in the source table (less than 100k rows)
  • There is enough change in the source table to require that you read most of it each time (for example if more than half the rows change between extracts)
  • The data in the source table is used for periodic snapshots (for example a balance sheet) and you need to track how a table changes at particular points in time

Do an incremental extract if:

  • There is a lot of data in the source table
  • Rows are only ever added to the source table (i.e. rows are not updated)
  • You need to track each and every change to a source row
  • The source data is updated a number of times before being closed and once closed is never updated again (also known as an accumulating snapshot)

In general dimension tables match the first set of rules and are not extracted incrementally where as fact tables normally match the second set of rules.

This posting is provided "AS IS" with no warranties, and confers no rights.
posted on Monday, December 22, 2008 6:51:04 PM (GMT Standard Time, UTC+00:00)  #    Comments [0] Trackback
# Saturday, December 18, 2004

By way of Euan Garden:

Hey guys, how about a nice article on automated SSIS deployment for us ISVs that need to write installers for our products?

Also, don't forget to check out Darren and Allan's SQLDTS.com site reincarnated as SQLIS.com - SQL Server Integration Services on the Web.

Now playing: - Ryan's Radio

This posting is provided "AS IS" with no warranties, and confers no rights.
posted on Saturday, December 18, 2004 2:35:34 PM (GMT Standard Time, UTC+00:00)  #    Comments [1] Trackback