Retrofitting Table Level
Delete/Archive Strategies - Unbounded Table Growth by Design
by
Bill Wunder
In this discussion we will look
at some data maintenance techniques for logging type tables.
This will not be a discussion about the SQL Server Transaction
Log database device that records all transactions and the
database modifications made by each transaction and is used for
recoverability and replication by SQL Server. Instead, we'll be
examining a particular type of application table that records
events at the application level. Typically a logging table will
see a relatively high volume of inserts and is designed to
always have rows added to the end of the data set. Optimally, to
achieve this end the data set is ordered by the clustering key
of the table. The main benefit of this design is that new rows
do not suffer the overhead of the b-trees split and promote
processing at insert. Logged data is sometimes referenced later
by the application but ideally it is not changed once it is
inserted into the logging table. In practice it is far too
common for a logged data table to be used to track the state of
a main "thing" or entity described in the initial logged event.
Likewise it is far too common for an application design to
consider only rows being added to a logging tables and to omit
considerations for managing unbounded growth of the data set. In
this discussion we'll try to cover some ways to accommodate
these table and application design deficiencies without the need
to rewrite applications solely for the purpose of being able to
maintain the data.
Examples of events that might get
logged might be DNA samples entering a universities research
database, shipments sent from a clearing warehouse, story
requests at a e-magazine, pump bearing temperature measurements
for a water treatment plant, or even the access per user to an
interesting web page. I'll be talking in more detail about about
these examples later in the discussion. First let's consider
some general software development practices and begin to
understand why and how managing the size and growth of logging
tables can become problematic.
I keep not reading about
people that have problems trying to maintain logging table
growth using a datetime column index. At the same time I keep
seeing applications in my shop choking on that very activity.
Admittedly, part of the problem is that we wait until we have
tens of GBs of data in a table before someone decides to try to
schedule a job or execute a hastily thrown together one off
purge from Query Analyzer that is supposed to keep just the last
six months or two weeks or only those rows with a Boolean value
set to the incomplete state or what ever comfort zone of history
makes sense for a given logging table. Interestingly, the
reasons for keeping n days or weeks or months or years of
data or only a particular state aren't always well founded.
Still, illusive business rules and questionable reasons not
withstanding, the impact on the system tends to be fairly
predictable. The delete/archive processing is slower than
expected and the contention inflicted on an application as it
attempts to slip additional records into the logging table
during the longer than expected delete/archive operation
processing is much worse than anticipated. In the worst cases,
particularly when logging volume is high and the number of rows
to be deleted is large, primary application queries will time
out waiting for the delete operation to complete or a hastily
concocted ad hoc purge will irreversibly remove the wrong rows,
or another surprising calamity will befall the application as a
direct result of trying to remove the presumed to be unneeded
rows.
For a date based delete, for
example, we might typically see a first iteration approach such
as:
declare @Cutoff datetime
set @Cutoff = getdate() -
180
delete SampleLog where
CompletionDate < @Cutoff
And then, after the discovery
that the delete operation is still executing hours later and
holding a table lock along with many more granular locks and
thereby preventing all inserts the developer may kill the job -
or complain to a DBA that there is "something wrong with the SQL
Server" and the DBA kills the job. Then the developer goes back
to the cutting room and returns with something like:
declare @Cutoff datetime
set @Cutoff = getdate() -
180
set rowcount 1000
delete SampleLog where
CompletionDate < @Cutoff
while @@rowcount = 1000
delete SampleLog where
CompletionDate < @Cutoff
set rowcount 0
Usually after the second attempt
fails when the system is brought to it's knees due to severe
disk queuing or the delete simply cannot remove rows as fast as
they are inserted or the log device fills a disk in Full
Recovery mode or another of the myriad of "gotcha"s gets us we
have to come up with yet another alternative. In most cases the
situation is somewhat critical in the moment and it may even be
the middle of the night so a truncate table issued on the log
table can bring some temporary relief if the available disk
space has been exhausted or a server restart can get a server
that has stopped responding to respond again or maybe even a
combination of a table truncation and a service restart is
needed. Sooner or later though it becomes necessary to come up
with a solution that is elegant and provides a means of
preserving the desired log data and gets the aged rows off the
server.
One class of tactic that proves
flexible and effective for me in retrofitting some discipline
into applications with unbounded logging table growth built in
to the design are some variations of locally partitioning the
logging table. There doesn't seem to be a single one size fits
all partitioning strategy because logging table data usage is so
different from application to application and even logging table
to logging table. The range of partitioning possibilities is
from the classic fully partitioned view down to a legitimate
regularly scheduled table truncation. A good understanding of
the possibilities and a complete consideration of a logging
table's usage are key to unlocking the right solution.
My first preference for planning
delete/archive strategies is to do so when the application or
application subsystem is still in design. Of course I frequently
hear the farfetched - even if somewhat founded in reality -
argument that the system with gigabytes of log data is
still in design. Always reluctant to stay out of other peoples
fantasies I won't delve too deeply into that position. Suffice
it to say that when I say planning delete/archive in design
phase I mean empirically deciding how much log data to keep,
exploiting the features of the architecture to support the
delete/archive strategy, and actually testing the delete/archive
processing in the integrated system testing environment with a
specific test plan element to measure design compatibility and
contention issues between the delete/archive process and the
application. Designing for unbounded growth of data that does
not need to be kept indefinitely is akin to planning a trip to
Mars and holding of on how to achieve the return trip until you
get to the red planet. The truth is I have yet to work in the
shop that considers delete/archive a legitimate aspect of
application development. (That is to say I know many developers
that are now permanent Martian residents.) I have however, over
and over again, lived through the production crises and failures
that come from such denial in the software lifecycle. It's an
interesting even if frustrating problem. My hope in sharing some
solutions here is to alleviate some of the frustration for you
if you find yourself having to fight such problems of inadequate
planning and design; and also to provide some apologetics for
proper design fundamentals. The latter owing to the fact that
shoehorning a delete/archive strategy onto an application as we
must do in these cases is somewhat complicated, potentially
fragile, and always a more difficult path than thinking about
application needs and realities during the early design phases
of a project. Or, to complete my metaphor, we are much more
likely to have the resources to get home from Mars if we plan
the whole trip before the launch date.
Since I'm describing only a
retrofit to an existing application in this discussion, I want
to be clear that one of the objectives of such an initiative
needs to be to limit the changes needed in the application logic
as much as possible. I also want to be clear that I am making no
certification as to the legitimacy of logging table data in this
discussion. At the point of disruption to production it is
incumbent upon the DBA to first do what is necessary to
stabilize the production environment and then - if necessary and
appropriate - to invite a consideration toward redesign of the
application or subsystem. If we were designing holistically and
a priori the possibilities are not so limited, but in
this discussion where the application context is a pre-existing
and an in situ given (i.e. already in production) it is
of the utmost importance that we do not introduce monumental
application changes in order to provide an effective
delete/archive process. Almost seems a little silly at one level
- don't you agree - to have to take this narrow approach but in
the context of regression testing and iterative development
there is clearly an imperative to not create a new application
solely so the application can effectively rid itself of unneeded
data.
With that said, I want to next
state that it may not always be possible to completely avoid a
rewrite when delete/archive has been ignored at the fore of the
development process. It is even more likely that a retrofit
delete/archive can and will degrade other application behaviors.
Adding delete/archive capabilities by definition involves
touching the fundamental underpinnings of a data model and
therefore compromises are unavoidable. Knowing when to step back
and alter an application design and when to forge ahead with a
delete/archive retrofit can be treacherous and deserves the same
care and consideration as any other software development
iteration. There should always be checkpoints and collaboration
with any software revision - just as this is crucial in the
initial design - to keep from ending up with an unworkable
solution.
Building from this groundwork, in
my next installment I'll move into a discussion that attempts to
classify logged tables and then I'll propose some variations of
data partitioning and suggest some tools and tricks that might
be used to effectively support a delete/archive approach for
each of those general classifications. And since we're all all
stuck at least 35 million allegorical miles from home until we
can get this worked out I'm hoping you find the time to check it
out.
Bill
|