| 
				 
				Retrofitting Table Level 
				Delete/Archive Strategies - Classifications of Logging Tables 
				
				By 
				Bill Wunder 
				Welcome back for the second part 
				of a discussion concerning remedial delete/archive methods for 
				logging type tables that feature untenable and unbounded growth 
				due to poor or inadequate application design. If you didn't
				catch part one I 
				encourage you to give it a read before you continue. In that 
				piece I defined a logging table as a genre of data container 
				with a high volume of inserts that is well organized on it's 
				clustering key. Then I explored the problems and pitfalls 
				brought on by otherwise successful application designs that hit 
				the wall due to unbounded logging table growth, the possibility 
				of using data partitioning to address the  problems, and the 
				advantages of better design efforts over the need to implement a 
				delete/archive strategy after an application is deeply involved 
				in the production environment. At this time it seems apropos 
				to continue that discussion with a break down of some of the 
				different general types of logging tables. The value in 
				classifying logging tables will then be found in allowing the 
				general type of a logging table guide us to a suitable 
				partitioning design and rollover process.   
				Wholly Unnecessary 
				Sometimes the application does 
				not directly use the logged data. The rows are thrown into a 
				table "just in case". "Just in case" an ad hoc audit becomes 
				necessary or "just in case" a need to debug application 
				behaviors arises or "just in case" some other unknown, undefined 
				or speculative purpose suddenly becomes known, defined, or 
				meaningful. If there truly is no defined purpose for the data 
				and the business and/or development staff is unable to see it's 
				way clear to eliminate the unnecessary processing cycles and 
				storage requirements necessary to support the data (I can't 
				believe how many times I've run in to this conundrum) then 
				truncate table may be as good a delete/archive strategy as 
				any. If you can find a consensus that there is no need to 
				maintain a history of the logged data once it reaches a certain 
				and relatively young age then all that remains is to set a 
				truncation frequency and create a SQL Agent job. If you do want 
				to have access to the historically logged data - again probably 
				"just in case" - then immediately after each regular full backup 
				clear the table and start growing it again. If the data is ever 
				needed you can go back to the tape archive for the desired day 
				and load up a snapshot of the database complete with the desired 
				log data onto a playground SQL Server. Of course, this strategy 
				is only as deep as your tape archive strategy so be sure to 
				understand and adjust the tape rotation policy as necessary to 
				support whatever perceived business requirement you are trying 
				to satisfy in the plan to save the ostensibly worthless 
				historical rows. 
				If you are keeping relatively 
				large reserves of data "just in case" there ought to be a big 
				neon yellow flag with a bold red question mark in the center 
				waving over such a pile of bits. Be sure not to be the one that 
				stands on such a mountain of data holding that flag because its 
				more likely that you're perched on a pile of rotting garbage as 
				you are on a store of gold ore or , if there is gold, its likely 
				to be in such low concentration as to not be economically viable 
				to extract. Keeping data "just in case" seems to be a 
				surprisingly common phenomenon. I suggest that this is usually 
				an ill-conceived attempt to anticipate future unspecified needs 
				in software development: far too common an elements of built in 
				inefficiency in otherwise quality application designs. In my 
				view the realm of software design and development is quickly and 
				surely evolving in the direction of rapid iteration cycles 
				consistent with "just in time" (much different than "just in 
				case") business process models, 
				
				Rapid Application Development  (RAD) philosophies and
				Extreme Programming 
				type software lifecycle models. This evolution is counter to 
				such well meaning but unproductive tactics as keeping "just in 
				case" data and other attempts to anticipate and deploy guesses 
				as to future needs in software development. Even
				
				proponents of anticipation in software design recognize the 
				perils and heretofore wasted time involved in such development 
				practices. In general I think it is a more effective development 
				strategy to stick to the specifications. If there must be 
				guessing, it should at least be limited to the design phase.
				 
				From the logging examples 
				proposed in the first part of this discussion, consider the 
				logging of each persons access to a specific web page. Let's 
				suppose that the original requirement was to record the number 
				of times each user accessed a web site. Then , perhaps in an 
				attempt to be thorough, a well meaning developer took it upon 
				himself to extend that requirement and record the access of each 
				web page on the site by user. As it happened, an interest in 
				this information did not materialize and the web master is 
				relying on page hit counts readily available from the web server 
				tool suite to determine usage patterns. We now have either a 
				large table with unusable heterogeneous data containing an 
				access log to every page on a web site or possibly multiple 
				tables each containing access logging information for specific 
				pages depending upon which blind alley our developer took us 
				down. In either case the data is steadily growing at what ever 
				rate people are looking at the web pages and the growth is 
				unbounded. Quite possibly the developer is long gone and the web 
				site is now oozing outside of it's allocated database storage 
				limits. The right thing to do in this case is remove the 
				unnecessary logging. In the interim a safety net solution might 
				be to regularly truncate the table(s) via a SQL Agent. 
				Hopefully such situations are the 
				exception but if that is completely true then I'd have to say I 
				seem to find myself in some truly exceptional shops. 
				History of the World 
				At the other end of the spectrum 
				is logged data that is heavily referenced by the application and 
				must be available for an extended period of time or even in 
				perpetuity in it's purest form. Often the data container is well 
				known and must be quickly accessible. Well know in this case 
				means that the table is referenced by multiple store procedures 
				and quickly accessible implies that there may be multiple 
				indexes necessary. At it's most extreme, the table is subject to 
				updates as well as the logging inserts and selects. It may be 
				difficult or impossible from a data consistency perspective to 
				safely remove even the oldest of rows from the table. If such 
				tables cannot be sent off to the data warehouse's
				
				star schema you may have one of the most difficult of logged 
				data maintenance problems.  
				If the business decision comes 
				down that you can never delete from the table at least you won't 
				have to address the delete processing dilemma. You will still 
				need to create a means to monitor and manage the ever increasing 
				storage requirements and - depending upon the impact of update 
				processing and variety of access paths - table fragmentation 
				maintenance overhead. If the business requirements dictate that 
				a very large amount of data must remain yet the oldest records 
				must be removed from the collection you will indeed have your 
				hands full. The local table partitioning schemes that are my 
				focus in this discussion are not likely to be practical in this 
				scenario. For this you'll want to fully investigate your 
				scale-up and scale-out options to find a workable answer.  
				Almost certainly in this situation it will be necessary to make 
				significant application layer changes to support radical storage 
				changes such as moving the data container into a
				
				distributed and partitioned storage design. A Books Online 
				keyword search on "distributed partitioned views" is one good 
				place to continue your research on managing a data design that 
				requires support in the unbounded growth mode.    
				The DNA sample logging would 
				probably be kept around indefinitely. There may be tens of 
				thousands of DNA samples, but fortunately not tens of millions 
				of samples in this case. Most likely the logged row to record 
				the receipt of each sample should never be updated to assure an 
				accurate and unchanged history of when DNA samples entered the 
				system but the need to access even the oldest samples must be 
				certain. In all likelihood the range and domain of this data 
				could effectively be housed in a single logging table and 
				unbounded table growth could be supported by a relatively modest 
				storage requirement. Most likely unbounded table growth was not 
				considered at design time and by pure luck can probably be 
				supported unchanged for the life of the application. It's when 
				the circumstances necessitate keeping much larger row 
				collections around and in tact in perpetuity that you will find 
				the need to explore the possibilities listed above.             
				 
				Divide and Conquer 
				Hopefully you don't have to deal 
				with the Wholly Unnecessary use case and as stated the 
				History of the World case likely has no delete archive 
				requirement appropriate for our discussion of effective local 
				partitioning strategies. If you do find that you must attack 
				such architectural issues with the strategies presented here 
				chances are that there are some forces at work in your shop that 
				are well outside the realms of normal and reasonable software 
				development philosophy (you have my sympathy). That's not to say 
				that what does fall into the range of normal and reasonable 
				development philosophies means the challenges are easy but it 
				does provide some hope that we can respond to the delete/archive 
				requirements without the need to go back to the drawing board  
				to support the History of the World or to to implement 
				complex and/or esoteric process to support obtuse business 
				requirements. Even though the unbounded growth model is usually 
				the design default it is almost never the correct or appropriate 
				model. In most situations I suggest there is a practical way to 
				locally partition a table to achieve optimal delete/archive 
				efficiencies with minimized application affects and not too much 
				brain pain. What I don't want to do is limit your possibilities 
				to only the partitioning schemes presented here. The first step 
				toward solving any delete/archive problem ought to involve some 
				thought about what makes the most sense in that particular 
				situation. In many - perhaps most - circumstances a data 
				partitioning strategy will provide a satisfactory and relatively 
				easy growth management resolution.  
				Between those situations where we 
				don't really need any of the data or those where we need to keep 
				and use it all I suggest we will find most of the real world 
				situations. The actual designs and therefore the actual growth 
				management requirements tend to fall always at a different spot 
				on a continuum between those two extremes. Your exact situation 
				will probably not fit neatly into any one classification.  For 
				the purposes of this discussion I'm going to proceed with the 
				understanding that you will need to adequately evaluate your 
				unbounded growth problem and that upon completion of that 
				evaluation you feel some confidence that a partitioning strategy 
				will serve to manage that growth problem. You may need to keep 
				only a small frame of logged data that we can classify as short 
				term purposes that we can think of like an episode of 
				Saturday Night Live: make the most of it in the moment but 
				not much need to remember much about it tomorrow. Perhaps you 
				need to keeps enough rows to satisfy the smallest degree of 
				aggregation you want to derive from the data such as reports 
				requested for a day or perhaps because the data looses it's 
				validity in a full business lifecycle OLTP system such as 
				unfulfilled orders that have subsequently spawned back-orders. 
				At the next classification you may need to have a somewhat more 
				heterogeneous application requirement. Most of the time data 
				models will fall into this Middle of the Road 
				classification. A typical example would be the need to provide a 
				months worth of drill down capability for the daily report usage 
				aggregations or a processing requirements that may need data 
				older than the primary application usage of the data such as a 
				15 day moving average stock level computation for the aged 
				unfulfilled orders. And finally, getting close to the History 
				of the World end of the spectrum, you may have data that 
				needs to stay around for a fair amount of time and is 
				interesting to multiple application components for the Life 
				of the Party such as the pump bearing temperature samples in 
				our earlier laid out collection of locally partitioned 
				candidates. In that case tracking the bearing temperature 
				trends along with a variety of other control measurements in the 
				treatment plant can be useful to identify holistic patterns 
				within the bearing, the pump, and even the plant in a more 
				sophisticated factory floor monitoring application. There are 
				plenty of other real world examples of a Life of the Party
				situation and as you might expect, the
				details 
				can get pretty hairy as the Life of the Party problem 
				approaches the History of the World extreme. 
				In the next and final installment 
				of this series we will go into the local partitioning 
				implementation coding details and differences useful to convert 
				an ever growing problem table that falls somewhere along the 
				Divide and Conquer continuum class into a locally 
				partitioned view. We'll consider tactics for converting to the 
				partitioned arraignment as well as rollover strategies necessary 
				to keep the growth under control. You'll find some of the T-SQL 
				examples I hope you can use in concert with the any analytical 
				insights enabled by the discussion up to this point to help 
				things run smoothly in your shop. 
				Bill    |