Checksum and Distributed Query Processing

Checksum and Distributed Query Performance

By Bill Wunder

checksum is an interesting even if little discussed SQL Server 2000 function. The function like the related binary_checksum touts itself as a way to improve performance by creating a computed column as an alternative to indexing on long character columns.

As the documentation explains, if you create an index on the checksum of a column or column list you will materialize an integer hash index on the character data. The benefit is that you can then do an equality search of the character data on the integer column. For wide character columns in long tables this benefit can be significant. It is important to note a few limitations of such indexes. First, the checksum index should never be mistaken or misused as a unique index. There will always be a statistically small possibility that the same checksum integer will represent two or more very different character strings. Similarly, the checksum value may not be a particularly useful clustering key when the character data is volatile since the checksum is updated whenever the character data is updated. Finally, the level of abstraction of the checksum column, as with so many advanced SQL Server features can result in more problems than it solves for the relatively unsophisticated query writer. Range queries of the character data, for example, supported only by the hash index in a multimillion row table could see very long execution times. Still, checksum has some interesting possibilities, not only in the context of a hashing index, but also in comparing columns, rows, and even entire tables.

As you may be able to see already, I’d suggest that any checksum hashing index strategy be approached with care and thorough testing. It’s going to be one of those possibilities that is effective in only a limited number of situations. Careful analysis will be needed to determine how long the character column(s) must be and how many rows a table might hold before the computed column is worthwhile. As with any good indexing strategy, a complete understanding of all queries the data and access paths must support will be necessary to make good decisions. Since the resulting hash index will lack transparency to the typical developer, there is even a pretty good possibility that the computed column index could cause some surprising - read that poorly performing – query plans. Not to mention the trade-offs that might arise if full text indexing is a piece to the puzzle.

Using checksum as a tool to expedite data compares is a more obvious opportunity to put checksum to work quickly and with surprising ease. In our shop we recently found such an opportunity in the process of building out two redundant and ostensibly independent data centers.

Because our data changes in many ways, we elected not to go with SQL Server replication. We weren’t comfortable with the customization, latency, heterogeneous data import failure/recovery vagaries, and snapshot synchronization issues we faced in order to implement a native SQL Server replication. Instead we’ve rolled our own application level synchronization tactics. Data Feeds are processed independently at each data center. OLTP activity is replicated at the application layer and the replication processing is fully disconnected from the ACID properties of the OLTP processing. Since our purpose in redundant data centers mandates full application operability in the event one data center became unusable it made sense to make sure each location had all necessary infrastructure to stand on its own. To add the burden of SQL replication meant an additionally layer of support and additional failover complexity.

The vast majority of our data changes are the result of disparate data feeds. Some feeds come as files of completely new data, some come as deltas, some as new data sets supplemented by deltas, and some as “real time streams” that may require inserts, updates, or deletes as appropriate. And to reiterate, this data needed to be assimilated by each data center independently.

We also have a few types of OLTP activity changing the data sets. One is a document management system for process and storage tracking of the large body of feed data. Another type supports changes that individual users can make to the way they choose to use the feed data. Typically this can include login, many settings under the umbrella of preferences, and notification or alert components. Yet another is for application level administrators that may need to make changes for a specific group of users. Typically both individual user changes and administrator class changes occur via the Internet. This is important because it means that some care is necessary to assure that the user or administrator is not subjected to any point in time concurrency anomalies such as might occur if a change is applied at one data center and then the next request from that user occurs at the other data center before the change has fully propagated.

It is beyond my purpose here to get too caught up the details of the applications. Suffice it to say that our solution includes a layer of checks to make sure the data is consistent between data centers. We’ve been able to put the checksum – more specifically the binary_checksum and checksum_agg over the plain vanilla checksum function – to help us check for data drift between our active/active data centers. Using checksum has proven to be many times faster than the character column comparisons that would otherwise be necessary.

Why binary_checksum? Because binary_checksum will produce a different result for “Bill” and “bill” while checksum – at least in our case insensitive collation – will not recognize the difference.

Some data, such as file system paths and identities we may want to exclude from comparison because they will always be different. Other data such as text and image data are “noncomparable” - I love that word – and also must be excluded from checksum comparisons to avoid fatal errors. For these reasons we couldn’t use a generic binary_checksum(*) query. Instead we needed to build column lists to feed to the binary_checksum by way of a dynamically created execution string. The dynamic string inserts into an already existing temp table. (In the example a second column in the temp table is defaulted to 0. Later we will update that column with the checksum_agg result from the other data center.) This provides an integer representation of columns we want to compare. Then we compute a checksum_agg for all rows of the table for comparison with an integer resolved in the same way at the other data center. Note that we also support a where clause through the @Filter variable. The “where” clause allows us to check only rows that are in a date range or any other identifying criteria. Using the dynamic string in an EXEC() keeps us from failing if the total length of the string exceeds a particular value:

exec('insert #TableCheckSum' +

' select checksum_agg(binary_checksum(' + @CompareColumnList + ')) , 0' +

' from [' +

@lnkServerName1 + '].[' +

@DatabaseName + '].['

@SchemaName + '].[' +

@TableName + '] t ' + @Filter)

The ability to accept more than 4000 or 8000 characters in the dynamic exec string has its place. However, for tables with millions of rows or more, this construct causes the binary_checksum to be calculated on the local server rather than the server where the data lives. That means we move all the data across the network just to get a single valued result. In those cases we found it to be much more efficient to dynamically build an OPENQUERY() string and execute it. This meant we made sure the length of the dynamic string did not exceed the 4000 character limit allowed by OPENQUERY(). For queries that meet those limitations we can realize undeniable performance gains with a little dynamic SQL gymnastics:

select @statement =

' select @Checksum=SumOfChecksum' +

' from OPENQUERY([' + @linkedServerName

+ '] , ''select checksum_agg(binary_checksum(' +

@CompareColumnList + ')) as SumOfCheckSum' +

' from [' +

@DatabaseName + '].[' +

@SchemaName + '].[' +

@TableName + '] t ' + @Filter + ''')'

exec sp_executesql

@statement

, N'@Checksum int output'

, @Checksum output

As you can see, with distributed queries having your cake and eating it too is not simple, but if necessary it can be done. Truthfully, it would probably be a more elegant and scalable solution to submit the dynamically generated query from a "real" client using ADO or a .Net Data Adapter into a disconnected recordset rather than the temp table. The main advantage of the linked server approach in the situation presented here is the very short time required to get a useful data drift tool into operation.
Hopefully this gives you a feel for some of the cool stuff possible with T-SQL’s checksum functionality in SQL Server 2000 and some ideas about how you can optimize distributed query processing and even when distributed query processing might be a good tactic. I’d be interested in hearing about other ways linked servers are serving you well and especialy any linked server optimization tips I might be able to pass along to other readers.

Bill