The SQL Thing that Wouldn’t Grow
By Bill Wunder
There is always a new and
interesting problem in the dynamically changing software shop.
Hardware upgrades, service packs, new releases, hot fixes,
performance issues, new requirements, and on and on.
Occasionally something doesn’t work as expected or something
stops working like it had been working. Usually we just have to
take the challenges in stride, work through the problems when
possible else work around problems when we must. Truth is my
favorite part of being a DBA in the trenches is just this sort
of problem solving. I like it because it is necessary to gain
some understanding of the true nature of a thing – and software
is nothing if not a Thing! – in order to understand how
to change it’s behavior. I have to admit that sometimes I feel
like the a player in a B-grade horror film from the 1950s when
I’m in the throws of a particular problem. Often the
investigation necessary to understand what is going on is 99% of
the task and the solution is a simple Thing and in the
end I frequently move on to the next task once the current
problem is solved silently muttering something like , “I should
have been doing that all along.”
One such recent problem that still
has me muttering has led me to create a nifty little utility
that you may be interested in adding to your tool kit? Let me
tell you a bit about how I attacked this problem and then I’ll
show you the utility.
I have a database that is heavily
(excessively?) involved in logging everything it does and it
tends to do quite a lot. All of the sudden this database that
had been running 200 tidy transactions a second and had been
providing sub 30 ms query response times was stopped dead in the
water. Queries were timing out after the 20 second or 1 minute
or 2 minute connection time out period had elapsed and SQL Agent
jobs that had been running in a few seconds were not able to
complete after many minutes. CPU on the server remained at 15%
or less. Disk queuing and I/O were well below normal. There were
no messages in the SQL ErrorLog or the server Event Logs that
even suggested a problem. The first time it happened was
predictably the busiest time of the day. We struggled with it
for only a couple of minutes, noticed the logging activity
seemed to be where the processes were silently spinning, and in
the effort to try anything to get things running tried
truncating one of the logging tables whose data we could live
without only because it seemed to be involved in much of
whatever the problem was. Immediately the system was back to
normal. We were baffled but relieved... until it happened again
three weeks later, again during the busiest time of the day and
with no obvious symptoms. We were even quicker to truncate a
table this time knowing that it would keep the system running,
but still without a clue what was happening. We had a work
around but we now knew for sure we needed a better long term
solution. At this point I opened a support incident with
Microssoft Product Support Services (PSS).
This next time the problem returned was a couple of days later,
PSS wanted a trace of what was happening when the problem was
rearing its ugly face. This time it was early in the morning on
the weekend so while collecting a few minutes of Profiler output
for PSS I had the luxury of taking another minute or two to look
around before we truncated a logging table. One of the things I
did was take a quick peak at the “General” tab of the task pad
in Enterprise Manager (EM) for that database. Low and behold I saw that the free
space of the data device was at 0%. I quickly expanded the
device and the problem immediately cleared.
In later talking with PSS it seems
there is a known problem where auto grow fails due to the
timeout of the connection that needs more space to finish the
DML task (see the gory details in the Microsoft KnowledgeBase
document Q305635). What’s worse, when
this problem occurs there is no explicit message to tell the
user a failure has occurred. Obviously when we had thousands of
inserts a minute all needing more space to add a tuple and a
some of those connections time out in 20 seconds it doesn’t take
long for the SQL Server to become completely consumed by the
repeated attempts to try to grow and fail, try to grow and
fail,… Before too long at all, the Thing has a SQL
Server frozen in it's tracks.
Unfortunately, as we learned from
PSS, there is not a current fix for this problem. The currently
recommended work around is to not depend on Auto Grow. (I even
agreed to work with PSS to try to help come up with more
information so they can better document and then resolve their
bug. As it unfolded, they didn't need my help at all. They were
fully aware of the problem and what needed to be done to correct
it but the fix was "to high of a business risk". Hmmmm...)
I don’t know of anything in the
provided SQL Client tool set that will tell me how much free
space a database device has other than the EM task pad. If I
had time to load the task pad view for all my databases every
day that would be what I would do, but I’d rather find a more
efficient method. Loading a task pad takes several seconds for
each database and I have lots of database. I wanted to find a
better way to keep track of when a device is getting low on free
space and I wanted to automate the process of being informed of
the need to expand a database. In the back of my mind I even
wanted to be able to expand a device before I hit the point of
critical failure programmatically. I knew that EM must be using
Database Management Objects (DMO) to get the info to build the task pad. I went looking in
Books on Line and sure enough found just what I needed to build
my own free space monitor and to eventually build my own Auto
Grow engine once I gain confidence in the monitoring tool’s
accuracy and reliability.
I decided to use a Data
Transformation Services (DTS) ActiveX task
to loop through all the data devices in all database to extract
the DMO info I needed and, using ADO and a stored procedure,
insert that info to a table. Once I had all the data in a table
I could easily query the table, again using that same ADO
connection and a stored procedure that will send me and email if
I had a device that needed to be expanded soon. I would also be
building a history of the growth trends to aid in making more
accurate decisions in the future. Later I will enhance the
ActiveX script to use the DMO connection to actually expand a
device when necessary. For now I just want it to tell me when
free space is getting low.
In the next article,
Free Space Monitoring Tool I will describe
the tool and provide all the code necessary to build a
monitoring tool that can used to get you out of the business of
depending on SQL Server’s Auto Grow. I know it will be a long
time before I have enough confidence to depend on the Thing to
take care of the space needs of my dynamically changing
databases.
Bill
|