Checksum and Distributed Query
Performance
By
Bill Wunder
checksum is an interesting
even if little discussed SQL Server 2000 function. The function
like the related binary_checksum touts itself as a way to
improve performance by creating a computed column as an
alternative to indexing on long character columns.
As the documentation explains, if
you create an index on the checksum of a column or column list
you will materialize an integer hash index on the character
data. The benefit is that you can then do an equality search of
the character data on the integer column. For wide character
columns in long tables this benefit can be significant. It is
important to note a few limitations of such indexes. First, the
checksum index should never be mistaken or misused as a unique
index. There will always be a statistically small possibility
that the same checksum integer will represent two or more very
different character strings. Similarly, the checksum value may
not be a particularly useful clustering key when the character
data is volatile since the checksum is updated whenever the
character data is updated. Finally, the level of abstraction of
the checksum column, as with so many advanced SQL Server
features can result in more problems than it solves for the
relatively unsophisticated query writer. Range queries of the
character data, for example, supported only by the hash index in
a multimillion row table could see very long execution times.
Still, checksum has some interesting possibilities, not only in
the context of a hashing index, but also in comparing columns,
rows, and even entire tables.
As you may be able to see already,
I’d suggest that any checksum hashing index strategy be
approached with care and thorough testing. It’s going to be one
of those possibilities that is effective in only a limited
number of situations. Careful analysis will be needed to
determine how long the character column(s) must be and how many
rows a table might hold before the computed column is
worthwhile. As with any good indexing strategy, a complete
understanding of all queries the data and access paths must
support will be necessary to make good decisions. Since the
resulting hash index will lack transparency to the typical
developer, there is even a pretty good possibility that the
computed column index could cause some surprising - read that
poorly performing – query plans. Not to mention the trade-offs
that might arise if full text indexing is a piece to the puzzle.
Using checksum as a tool to expedite
data compares is a more obvious opportunity to put checksum to
work quickly and with surprising ease. In our shop we recently
found such an opportunity in the process of building out two
redundant and ostensibly independent data centers.
Because our data changes in many
ways, we elected not to go with SQL Server replication. We
weren’t comfortable with the customization, latency,
heterogeneous data import failure/recovery vagaries, and
snapshot synchronization issues we faced in order to implement a
native SQL Server replication. Instead we’ve rolled our own
application level synchronization tactics. Data Feeds are
processed independently at each data center. OLTP activity is
replicated at the application layer and the replication
processing is fully disconnected from the ACID properties of the
OLTP processing. Since our purpose in redundant data centers
mandates full application operability in the event one data
center became unusable it made sense to make sure each location
had all necessary infrastructure to stand on its own. To add the
burden of SQL replication meant an additionally layer of support
and additional failover complexity.
The vast majority of our data
changes are the result of disparate data feeds. Some feeds come
as files of completely new data, some come as deltas, some as
new data sets supplemented by deltas, and some as “real time
streams” that may require inserts, updates, or deletes as
appropriate. And to reiterate, this data needed to be
assimilated by each data center independently.
We also have a few types of OLTP
activity changing the data sets. One is a document management
system for process and storage tracking of the large body of
feed data. Another type supports changes that individual users
can make to the way they choose to use the feed data. Typically
this can include login, many settings under the umbrella of
preferences, and notification or alert components. Yet another
is for application level administrators that may need to make
changes for a specific group of users. Typically both individual
user changes and administrator class changes occur via the
Internet. This is important because it means that some care is
necessary to assure that the user or administrator is not
subjected to any point in time concurrency anomalies such as
might occur if a change is applied at one data center and then
the next request from that user occurs at the other data center
before the change has fully propagated.
It is beyond my purpose here to get
too caught up the details of the applications. Suffice it to say
that our solution includes a layer of checks to make sure the
data is consistent between data centers. We’ve been able to put
the checksum – more specifically the binary_checksum and
checksum_agg over the plain vanilla checksum
function – to help us check for data drift between our
active/active data centers. Using checksum has proven to be many
times faster than the character column comparisons that would
otherwise be necessary.
Why binary_checksum? Because
binary_checksum will produce a different result for
“Bill” and “bill” while checksum – at least in our case
insensitive collation – will not recognize the difference.
Some data, such as file system paths
and identities we may want to exclude from comparison because
they will always be different. Other data such as text and image
data are “noncomparable” - I love that word – and also must be
excluded from checksum comparisons to avoid fatal errors. For
these reasons we couldn’t use a generic binary_checksum(*)
query. Instead we needed to build column lists to feed to the
binary_checksum by way of a dynamically created execution
string. The dynamic string inserts into an already existing temp
table. (In the example a second column in the temp table is
defaulted to 0. Later we will update that column with the
checksum_agg result from the other data center.) This
provides an integer representation of columns we want to
compare. Then we compute a checksum_agg for all rows of
the table for comparison with an integer resolved in the same
way at the other data center. Note that we also support a where
clause through the @Filter variable. The “where” clause allows
us to check only rows that are in a date range or any other
identifying criteria. Using the dynamic string in an EXEC()
keeps us from failing if the total length of the string exceeds
a particular value:
exec('insert #TableCheckSum' +
' select
checksum_agg(binary_checksum(' + @CompareColumnList + ')) , 0' +
' from
[' +
@lnkServerName1 + '].[' +
@DatabaseName + '].['
@SchemaName + '].[' +
@TableName + '] t ' + @Filter)
The ability to accept more than 4000 or 8000
characters in the dynamic exec string has its place. However,
for tables with millions of rows or more, this construct causes
the binary_checksum to be calculated on the local server
rather than the server where the data lives. That means we move
all the data across the network just to get a single valued
result. In those cases we found it to be much more efficient to
dynamically build an OPENQUERY() string and execute it.
This meant we made sure the length of the dynamic string did not
exceed the 4000 character limit allowed by OPENQUERY().
For queries that meet those limitations we can realize
undeniable performance gains with a little dynamic SQL
gymnastics:
select @statement =
' select @Checksum=SumOfChecksum' +
' from OPENQUERY([' + @linkedServerName
+ '] ,
''select checksum_agg(binary_checksum(' +
@CompareColumnList
+ ')) as SumOfCheckSum' +
' from [' +
@DatabaseName
+ '].[' +
@SchemaName
+ '].[' +
@TableName
+ '] t ' + @Filter + ''')'
exec sp_executesql
@statement
,
N'@Checksum int output'
,
@Checksum output
As you can see, with distributed queries
having your cake and eating it too is not simple, but
if necessary it can be done. Truthfully, it would probably be a
more elegant and scalable solution to submit the dynamically
generated query from a "real" client using ADO or a .Net Data
Adapter into a disconnected recordset rather than the temp
table. The main advantage of the linked server approach in the
situation presented here is the very short time required to get
a useful data drift tool into operation.
Hopefully this gives you a feel for some of the cool stuff
possible with T-SQL’s checksum functionality in SQL Server 2000
and some ideas about how you can optimize distributed query
processing and even when distributed query processing might be a
good tactic. I’d be interested in hearing about other
ways linked servers are serving you well and especialy any
linked server optimization tips I might be able to pass along to
other readers.
Bill |