# Monday, January 10, 2005
« Merry Christmas, Chanukah, Ramadan, Yule... | Main | Page Rank »

I’ve been a bit lazy with blogging recently so I wanted to talk about a question that came up in the office today. The question was about the utility and performance of unique identifier columns as primary keys.

A unique identifier or GUID is a 16 byte data type; four times the size of an integer normally used for primary keys. This means that any index built using a GUID will be larger and contain less records per page. I managed to find this thread on p2p.wrox.com which seems to indicate there are some special index optimisations (cascading GUID blocks) in use for GUIDs. This is unconfirmed though so maybe someone from the SQL Server team can provide a definitive answer.

The size of the datatype can add up when used in data warehouse fact tables because each fact may have a large number of dimensions. For this reason I would probably stick to integer dimension keys.

GUIDs, by their very nature, are non-sequential which completely screws clustered indexes. There is a new sequential generation algorithm in Yukon which solves the issue although I can’t think of a reason why you would use a clustered index on a GUID because they are better suited for range queries. At Exony we hardly ever create a clustered index on the primary key as it’s normally reserved for datetime or date key columns.

The above seems quite negative but I think there are some good points to remember. There are some definite ease of use considerations with GUIDs when compared to IDENTITY columns:

  • Identities need to be reseeded in replicated systems to ensure inserts don’t clash
  • For GUID columns, you can either supply a value or rely on a default safe in the knowledge you won’t clash values
  • GUIDs can be generated at build time and embedded in SQL or load files simplifying installation code
  • GUIDs can be generated on remote systems or in higher system layers without issue

To end I think the general ease of GUID use outweighs any performance degradation in normal operations, although probably not in data warehouses. I see that the SQL Server team use GUIDs for primary keys (see the Reporting Services database for details) so there can’t be too much wrong with them.

This posting is provided "AS IS" with no warranties, and confers no rights.
posted on Monday, January 10, 2005 9:57:21 PM (GMT Standard Time, UTC+00:00)  #    Comments [3] Trackback
Related posts:
Do you have great business intelligence skills?
Integration Services Design Principals
Physical Data Warehouse Design
Analysis Services Essential Reading
When should you do an incremental extract?
Post TechReady and Popfly Invites
Tuesday, January 11, 2005 1:00:09 AM (GMT Standard Time, UTC+00:00)
If you don't need replication, I'd probably use ints. But if you're using replication, definitely use a GUID for two reasons.

1. Identity management doesn't work great all the time. I ran into problems where the identity range management couldn't keep up with the inserts causing all sorts of ugly problems.

2. if you set up replication, you'll get a rowguid column added to your tables anyways.

When setting a guid as primary key, don't forget to set it as the rowguid column too, or you'll have two guids in your table.
Tuesday, January 11, 2005 10:09:12 AM (GMT Standard Time, UTC+00:00)
I can't comment on point 1 and point 2 may true for merge replication.

Transactional replication, on the other hand, does not work as described because the msrepl_tran_version column added by the replication system is not suitable for a primary key - it's value is updated continually by replication. We learned this the hard way when we tried to put an index on the column and screwed perf.

James.
Wednesday, January 12, 2005 1:02:30 AM (GMT Standard Time, UTC+00:00)
Hmmm, I don't recall the msrepl_tran_version column. I'll have to refresh my memory on that one. We had a setup with both merge and transactional repl.
Comments are closed.