Within a few hours of my announcement of coming back to business yesterday, my cellphone started ringing. Call from Office on a Saturday evening could not mean nothing but an escalation call. Indeed it was one.
The customer had some important month-end jobs running but the progress of the job was too slow. There are a couple of genii in the Application DBA Team who are part of the Application Development team. For any performance issue on the SQL Server instance they have only two resolutions. Rebuild the Index. If it does not work Update the Statistics. If neither does not work, disagree with whatever the Production DBA Team suggests! For this issue as well, they suggested that the indexes on a table be rebuilt by the Production DBA Team. Halfheartedly, the DBA started the rebuild index job for that table.
This is one of the SQL Server 2000 instances which has the worst possible disk subsystem. In the last one year, the disk subsystem, firmware, faulty cables attributed to atleast 5 high severity incidents. The reindex job had already ran for 12 hours and it was still trying to rebuild the 15 indexes of a table 50 GB in size. No other process was blocking this job, but it was too slow. Customer was curious to know what was going on. Since the job had already run for 12 hours, killing the job would mean that the rollback would take same amount time or possibly more. DBA kept on telling that since this instance is SQL Server 2000, there is no way to check the percentage of completion of this reindex job. Had the DBA written the script to rebuild the indices index by index, at least we could get which index was currently being rebuilt. But customer was in no mood to buy that theory because business critical jobs were on hold because of this.
The physical_io value in sysprocesses was increasing and the waittype, waittime values were also constantly changing. The kpid column had also some value in it. Hence this process was not dead. The waitresource column was the one which helped us to find out what exactly the process was doing.
The waitresource value had a Page address in it. All I had to do was to check the object and index id from the page. Fired up this query in the query analyzer to get that information.
dbcc traceon(3604) go dbcc page(7,4,2044928) go
The output had the object and index details.
The object id was that of the table whose indexes were rebuilt and the index id 13 was the 13th index in sysindexes out of the 15 in that table. Huh. Just informed the customer that the reindex job had already completed rebuilding 12 indexes and it was working on index 13. “That is what we were waiting to hear from Mr. DBA” replied the customer. Now that they found someone to keep asking for updates, all night I was Geoffrey Boycott, giving running commentary of the status of the reindex job except for Geoff’s Yorkshire accent.
I also noticed that once the reindex of this index completed and before it started the reindex of the next index, the process would wait on a page which had Object ID 99 on it. select object_name(99) would return the name Allocation. Allocation is not a table and it cannot be found in sysobjects as well. Allocation refers to the GAM, SGAM and the related system tables. I had to explain this as well to the customer.
After completing the activities on Allocation, the process would move on to the next index and the waiting page address would have this index details in it. In my case, it moved on to Index 14. After an hour or so, all the indexes were rebuilt (it took 23 hours to complete!) and as usual got a “Good job team” from the customer and dropped off from the call.
The clock had struck 8 AM already and started my Sunday by dozing off!