The SQL Thing that Wouldn’t Grow

By Bill Wunder

 

There is always a new and interesting problem in the dynamically changing software shop. Hardware upgrades, service packs, new releases, hot fixes, performance issues, new requirements, and on and on. Occasionally something doesn’t work as expected or something stops working like it had been working. Usually we just have to take the challenges in stride, work through the problems when possible else work around problems when we must. Truth is my favorite part of being a DBA in the trenches is just this sort of problem solving. I like it because it is necessary to gain some understanding of the true nature of a thing – and software is nothing if not a Thing! – in order to understand how to change it’s behavior. I have to admit that sometimes I feel like the a player in a B-grade horror film from the 1950s when I’m in the throws of a particular problem. Often the investigation necessary to understand what is going on is 99% of the task and the solution is a simple Thing and in the end I frequently move on to the next task once the current problem is solved silently muttering something like , “I should have been doing that all along.”

 

One such recent problem that still has me muttering has led me to create a nifty little utility that you may be interested in adding to your tool kit? Let me tell you a bit about how I attacked this problem and then I’ll show you the utility.

 

I have a database that is heavily (excessively?) involved in logging everything it does and it tends to do quite a lot. All of the sudden this database that had been running 200 tidy transactions a second and had been providing sub 30 ms query response times was stopped dead in the water. Queries were timing out after the 20 second or 1 minute or 2 minute connection time out period had elapsed and SQL Agent jobs that had been running in a few seconds were not able to complete after many minutes. CPU on the server remained at 15% or less. Disk queuing and I/O were well below normal. There were no messages in the SQL ErrorLog or the server Event Logs that even suggested a problem. The first time it happened was predictably the busiest time of the day. We struggled with it for only a couple of minutes, noticed the logging activity seemed to be where the processes were silently spinning, and in the effort to try anything to get things running tried truncating one of the logging tables whose data we could live without only because it seemed to be involved in much of whatever the problem was. Immediately the system was back to normal. We were baffled but relieved... until it happened again three weeks later, again during the busiest time of the day and with no obvious symptoms. We were even quicker to truncate a table this time knowing that it would keep the system running, but still without a clue what was happening. We had a work around but we now knew for sure we needed a better long term solution. At this point I opened a support incident with Microssoft Product Support Services (PSS).  This next time the problem returned was a couple of days later, PSS wanted a trace of what was happening when the problem was rearing its ugly face. This time it was early in the morning on the weekend so while collecting a few minutes of Profiler output for PSS I had the luxury of taking another minute or two to look around before we truncated a logging table. One of the things I did was take a quick peak at the “General” tab of the task pad in Enterprise Manager (EM) for that database. Low and behold I saw that the free space of the data device was at 0%. I quickly expanded the device and the problem immediately cleared.

 

In later talking with PSS it seems there is a known problem where auto grow fails due to the timeout of the connection that needs more space to finish the DML task (see the gory details in the Microsoft KnowledgeBase document Q305635). What’s worse, when this problem occurs there is no explicit message to tell the user a failure has occurred. Obviously when we had thousands of inserts a minute all needing more space to add a tuple and a some of those connections time out in 20 seconds it doesn’t take long for the SQL Server to become completely consumed by the repeated attempts to try to grow and fail, try to grow and fail,… Before too long at all, the Thing has a SQL Server frozen in it's tracks.

 

Unfortunately, as we learned from PSS, there is not a current fix for this problem. The currently recommended work around is to not depend on Auto Grow. (I even agreed to work with PSS to try to help come up with more information so they can better document and then resolve their bug. As it unfolded, they didn't need my help at all. They were fully aware of the problem and what needed to be done to correct it but the fix was "to high of a business risk". Hmmmm...)

 

I don’t know of anything in the provided SQL Client tool set that will tell me how much free space a database device has other than the EM task pad. If I had time to load the task pad view for all my databases every day that would be what I would do, but I’d rather find a more efficient method. Loading a task pad takes several seconds for each database and I have lots of database. I wanted to find a better way to keep track of when a device is getting low on free space and I wanted to automate the process of being informed of the need to expand a database. In the back of my mind I even wanted to be able to expand a device before I hit the point of critical failure programmatically. I knew that EM must be using Database Management Objects (DMO) to get the info to build the task pad. I went looking in Books on Line and sure enough found just what I needed to build my own free space monitor and to eventually build my own Auto Grow engine once I gain confidence in the monitoring tool’s accuracy and reliability.

 

I decided to use a Data Transformation Services (DTS) ActiveX task to loop through all the data devices in all database to extract the DMO info I needed and, using ADO and a stored procedure, insert that info to a table. Once I had all the data in a table I could easily query the table, again using that same ADO connection and a stored procedure that will send me and email if I had a device that needed to be expanded soon. I would also be building a history of the growth trends to aid in making more accurate decisions in the future. Later I will enhance the ActiveX script to use the DMO connection to actually expand a device when necessary. For now I just want it to tell me when free space is getting low.

 

In the next article, Free Space Monitoring Tool I will describe the tool and provide all the code necessary to build a monitoring tool that can used to get you out of the business of depending on SQL Server’s Auto Grow. I know it will be a long time before I have enough confidence to depend on the Thing to take care of the space needs of my dynamically changing databases.

 

Bill