I am having really tough time on two of the production servers in our environment. The storage related issues on these servers are causing problems for the databases. There have been many changes to the environment from storage perspective like firmware upgrade, storage migration to name a few. But the issues seems to be going from bad to worse.
Last week, on the instance which hosts the biggest data warehouse database of 10 TB in size we started getting messages related to disk subsystem. The worst part was that the storage demon had caught the 10 TB database itself! The error message read
Error: 823, Severity: 24, State: 2 I/O error (bad page ID) detected during read at offset 0x00005a0b020000 in file 'M:Data************_Data.NDF'
This error message would appear in the SQL Server Error Log at least 10 times a day. No corresponding error message in the System Event Log from the Disk Subsystems. As usual the storage team informed that “all looks good from Storage perspective”. Even a single occurrence of this message is enough send shivers down the spines of a DBA. Recurring messages meant that the DBA had to act quick and fast.
The first that a DBA a would do is to run DBCC CHECKDB against the database. As mentioned earlier, this database is 10 TB in size and running CHECKDB against this database would take at least 48 hours (from the recent experience on this database). Since the customer does not want his “load” jobs to be held for such a long duration, the other option was to run DBCC CHECKFILEGROUP. The file group on which this datafile was residing constituted 90% of the total database size! Hard luck.
The last option was to get to the table which was affected by this issue. But the error message did not have the Page ID, so that we could run the life saving DBCC PAGE command to get the table name. All it had was the hexadecimal offset id 0x00005a0b020000 for that file. Is there an option to get the Page ID from the offset?
The official documentation gets the due respect only at the time of crisis. In this case this KB article on 823 error was more than useful. With the help of this article, I was able to get to the affected Page ID. Here is how.
Launched Calc.exe and divied the hexadecimal value 0x00005a0b020000 by 8192
Now that I got a value of B1E749 all I need to do was to convert it to Decimal by clicking on the Dec radio button. The decimal equivalent of this was 11659081.
Now the Page ID is ready to be fired using the below query. The database ID is 8 and the File ID of the file in question is 37.
DBCC TRACEON (3604) go DBCC PAGE (8,37,11659081,3) go
This query gave the m_objid, i.e. the ID of the object which resides on this page.
The object id is also ready now and the table name is just a F5 away!
Now the task is simpler and quicker. A Change Control to run DBCC CHECKTABLE against this table is awaiting customer approval now.
The hole has already been dug up in this database, the output of CHECKTABLE will confirm how deep it is. Keeping my fingers crossed!