Retrofitting Table Level
Delete/Archive Strategies - Classifications of Logging Tables
By
Bill Wunder
Welcome back for the second part
of a discussion concerning remedial delete/archive methods for
logging type tables that feature untenable and unbounded growth
due to poor or inadequate application design. If you didn't
catch part one I
encourage you to give it a read before you continue. In that
piece I defined a logging table as a genre of data container
with a high volume of inserts that is well organized on it's
clustering key. Then I explored the problems and pitfalls
brought on by otherwise successful application designs that hit
the wall due to unbounded logging table growth, the possibility
of using data partitioning to address the problems, and the
advantages of better design efforts over the need to implement a
delete/archive strategy after an application is deeply involved
in the production environment. At this time it seems apropos
to continue that discussion with a break down of some of the
different general types of logging tables. The value in
classifying logging tables will then be found in allowing the
general type of a logging table guide us to a suitable
partitioning design and rollover process.
Wholly Unnecessary
Sometimes the application does
not directly use the logged data. The rows are thrown into a
table "just in case". "Just in case" an ad hoc audit becomes
necessary or "just in case" a need to debug application
behaviors arises or "just in case" some other unknown, undefined
or speculative purpose suddenly becomes known, defined, or
meaningful. If there truly is no defined purpose for the data
and the business and/or development staff is unable to see it's
way clear to eliminate the unnecessary processing cycles and
storage requirements necessary to support the data (I can't
believe how many times I've run in to this conundrum) then
truncate table may be as good a delete/archive strategy as
any. If you can find a consensus that there is no need to
maintain a history of the logged data once it reaches a certain
and relatively young age then all that remains is to set a
truncation frequency and create a SQL Agent job. If you do want
to have access to the historically logged data - again probably
"just in case" - then immediately after each regular full backup
clear the table and start growing it again. If the data is ever
needed you can go back to the tape archive for the desired day
and load up a snapshot of the database complete with the desired
log data onto a playground SQL Server. Of course, this strategy
is only as deep as your tape archive strategy so be sure to
understand and adjust the tape rotation policy as necessary to
support whatever perceived business requirement you are trying
to satisfy in the plan to save the ostensibly worthless
historical rows.
If you are keeping relatively
large reserves of data "just in case" there ought to be a big
neon yellow flag with a bold red question mark in the center
waving over such a pile of bits. Be sure not to be the one that
stands on such a mountain of data holding that flag because its
more likely that you're perched on a pile of rotting garbage as
you are on a store of gold ore or , if there is gold, its likely
to be in such low concentration as to not be economically viable
to extract. Keeping data "just in case" seems to be a
surprisingly common phenomenon. I suggest that this is usually
an ill-conceived attempt to anticipate future unspecified needs
in software development: far too common an elements of built in
inefficiency in otherwise quality application designs. In my
view the realm of software design and development is quickly and
surely evolving in the direction of rapid iteration cycles
consistent with "just in time" (much different than "just in
case") business process models,
Rapid Application Development (RAD) philosophies and
Extreme Programming
type software lifecycle models. This evolution is counter to
such well meaning but unproductive tactics as keeping "just in
case" data and other attempts to anticipate and deploy guesses
as to future needs in software development. Even
proponents of anticipation in software design recognize the
perils and heretofore wasted time involved in such development
practices. In general I think it is a more effective development
strategy to stick to the specifications. If there must be
guessing, it should at least be limited to the design phase.
From the logging examples
proposed in the first part of this discussion, consider the
logging of each persons access to a specific web page. Let's
suppose that the original requirement was to record the number
of times each user accessed a web site. Then , perhaps in an
attempt to be thorough, a well meaning developer took it upon
himself to extend that requirement and record the access of each
web page on the site by user. As it happened, an interest in
this information did not materialize and the web master is
relying on page hit counts readily available from the web server
tool suite to determine usage patterns. We now have either a
large table with unusable heterogeneous data containing an
access log to every page on a web site or possibly multiple
tables each containing access logging information for specific
pages depending upon which blind alley our developer took us
down. In either case the data is steadily growing at what ever
rate people are looking at the web pages and the growth is
unbounded. Quite possibly the developer is long gone and the web
site is now oozing outside of it's allocated database storage
limits. The right thing to do in this case is remove the
unnecessary logging. In the interim a safety net solution might
be to regularly truncate the table(s) via a SQL Agent.
Hopefully such situations are the
exception but if that is completely true then I'd have to say I
seem to find myself in some truly exceptional shops.
History of the World
At the other end of the spectrum
is logged data that is heavily referenced by the application and
must be available for an extended period of time or even in
perpetuity in it's purest form. Often the data container is well
known and must be quickly accessible. Well know in this case
means that the table is referenced by multiple store procedures
and quickly accessible implies that there may be multiple
indexes necessary. At it's most extreme, the table is subject to
updates as well as the logging inserts and selects. It may be
difficult or impossible from a data consistency perspective to
safely remove even the oldest of rows from the table. If such
tables cannot be sent off to the data warehouse's
star schema you may have one of the most difficult of logged
data maintenance problems.
If the business decision comes
down that you can never delete from the table at least you won't
have to address the delete processing dilemma. You will still
need to create a means to monitor and manage the ever increasing
storage requirements and - depending upon the impact of update
processing and variety of access paths - table fragmentation
maintenance overhead. If the business requirements dictate that
a very large amount of data must remain yet the oldest records
must be removed from the collection you will indeed have your
hands full. The local table partitioning schemes that are my
focus in this discussion are not likely to be practical in this
scenario. For this you'll want to fully investigate your
scale-up and scale-out options to find a workable answer.
Almost certainly in this situation it will be necessary to make
significant application layer changes to support radical storage
changes such as moving the data container into a
distributed and partitioned storage design. A Books Online
keyword search on "distributed partitioned views" is one good
place to continue your research on managing a data design that
requires support in the unbounded growth mode.
The DNA sample logging would
probably be kept around indefinitely. There may be tens of
thousands of DNA samples, but fortunately not tens of millions
of samples in this case. Most likely the logged row to record
the receipt of each sample should never be updated to assure an
accurate and unchanged history of when DNA samples entered the
system but the need to access even the oldest samples must be
certain. In all likelihood the range and domain of this data
could effectively be housed in a single logging table and
unbounded table growth could be supported by a relatively modest
storage requirement. Most likely unbounded table growth was not
considered at design time and by pure luck can probably be
supported unchanged for the life of the application. It's when
the circumstances necessitate keeping much larger row
collections around and in tact in perpetuity that you will find
the need to explore the possibilities listed above.
Divide and Conquer
Hopefully you don't have to deal
with the Wholly Unnecessary use case and as stated the
History of the World case likely has no delete archive
requirement appropriate for our discussion of effective local
partitioning strategies. If you do find that you must attack
such architectural issues with the strategies presented here
chances are that there are some forces at work in your shop that
are well outside the realms of normal and reasonable software
development philosophy (you have my sympathy). That's not to say
that what does fall into the range of normal and reasonable
development philosophies means the challenges are easy but it
does provide some hope that we can respond to the delete/archive
requirements without the need to go back to the drawing board
to support the History of the World or to to implement
complex and/or esoteric process to support obtuse business
requirements. Even though the unbounded growth model is usually
the design default it is almost never the correct or appropriate
model. In most situations I suggest there is a practical way to
locally partition a table to achieve optimal delete/archive
efficiencies with minimized application affects and not too much
brain pain. What I don't want to do is limit your possibilities
to only the partitioning schemes presented here. The first step
toward solving any delete/archive problem ought to involve some
thought about what makes the most sense in that particular
situation. In many - perhaps most - circumstances a data
partitioning strategy will provide a satisfactory and relatively
easy growth management resolution.
Between those situations where we
don't really need any of the data or those where we need to keep
and use it all I suggest we will find most of the real world
situations. The actual designs and therefore the actual growth
management requirements tend to fall always at a different spot
on a continuum between those two extremes. Your exact situation
will probably not fit neatly into any one classification. For
the purposes of this discussion I'm going to proceed with the
understanding that you will need to adequately evaluate your
unbounded growth problem and that upon completion of that
evaluation you feel some confidence that a partitioning strategy
will serve to manage that growth problem. You may need to keep
only a small frame of logged data that we can classify as short
term purposes that we can think of like an episode of
Saturday Night Live: make the most of it in the moment but
not much need to remember much about it tomorrow. Perhaps you
need to keeps enough rows to satisfy the smallest degree of
aggregation you want to derive from the data such as reports
requested for a day or perhaps because the data looses it's
validity in a full business lifecycle OLTP system such as
unfulfilled orders that have subsequently spawned back-orders.
At the next classification you may need to have a somewhat more
heterogeneous application requirement. Most of the time data
models will fall into this Middle of the Road
classification. A typical example would be the need to provide a
months worth of drill down capability for the daily report usage
aggregations or a processing requirements that may need data
older than the primary application usage of the data such as a
15 day moving average stock level computation for the aged
unfulfilled orders. And finally, getting close to the History
of the World end of the spectrum, you may have data that
needs to stay around for a fair amount of time and is
interesting to multiple application components for the Life
of the Party such as the pump bearing temperature samples in
our earlier laid out collection of locally partitioned
candidates. In that case tracking the bearing temperature
trends along with a variety of other control measurements in the
treatment plant can be useful to identify holistic patterns
within the bearing, the pump, and even the plant in a more
sophisticated factory floor monitoring application. There are
plenty of other real world examples of a Life of the Party
situation and as you might expect, the
details
can get pretty hairy as the Life of the Party problem
approaches the History of the World extreme.
In the next and final installment
of this series we will go into the local partitioning
implementation coding details and differences useful to convert
an ever growing problem table that falls somewhere along the
Divide and Conquer continuum class into a locally
partitioned view. We'll consider tactics for converting to the
partitioned arraignment as well as rollover strategies necessary
to keep the growth under control. You'll find some of the T-SQL
examples I hope you can use in concert with the any analytical
insights enabled by the discussion up to this point to help
things run smoothly in your shop.
Bill |