Starting this Monday some of my team members are working on a Disaster Recovery activity. This involves restoring some of the biggest Data Warehouse databases onto the DR server.
Today they were supposed to restore a database which is ~3 TB in size from disk backups. This instance is running SQL Server 2000. As usual they started the restore script using a SQL Agent job. The job output file showed the progress steadily and finally after 12 hours the output showed 100 percent completed. The output file looked something like this.
100 percent restored. [SQLSTATE 01000] Processed 53369168 pages for database 'dbname', file 'dbname_Data' on file 1. [SQLSTATE 01000] Processed 2945917 pages for database 'dbname', file 'dbname_Log' on file 1. [SQLSTATE 01000]
But the database restore is not complete until it prints RESTORE DATABASE successfully processed xxx pages in xxx seconds in the logs. When the DBA announced that the log had 100 percent completed in it, the Application Team were keen to do their part of testing and be done with the DR activity.
DBA told everyone on the bridge that it will take some “more” time for the database to be restored successfully. The obvious question was “How long?”. Since this instance is running SQL Server 2000, there is no way to estimate the time it takes to restore the database. If the instance was running SQL Server 2005 or above, it can be easily be tracked as I posted here. Three hours past, the database was still restoring and the nagging from the Application Team started getting more serious. The DBA on call was at the mercy of sysprocesses to check the status. All he could see was a restore session in which the waitresource column was changing. This confirmed that the Restore process was not hung but it was actually doing something.
What exactly the RESTORE session is doing? The RESTORE process has three phases.
- Data Copy Phase
- Redo phase
- Undo phase
In the Data Copy phase, all the data from the backup set is moved to the database files. Once this phase is complete, SQL Server reports that the restore is 100 percent complete and Processed xxx pages for database ‘dbname’, file ‘datafiles’ on file x.
In the Redo phase, all the committed transactions present in the transaction log when the database was being backed up are rolled forward.
In the Undo phase, all the uncommitted transactions in the transaction log while the database was being backed up are rolled back. If the database is being restored with NORECOVERY, the Undo phase is skipped. Unfortunately SQL Server does not print the progress during the Redo and Undo phases as it does in the Data Copy phase. This article on MSDN has good information regarding the Restore process.
With this information it is easy to understand why the restore was not showing any progress. Restore had progressed beyond the Data Copy phase and it was in the Redo/Undo phase. The database being restored was in simple recovery mode and the transaction log of the database inside the backup file was showing a size of 30 GB. The database backup had taken 12 hours to complete. Understandably the Redo/Undo phase had to roll forward/roll back the transactions that got logged in the transaction log during those 12 hours.
The DBA knew this and tried his best to convince the guys on the call. But there was no evidence to prove DBA’s argument. Finally the database restored completely after ~16 hours (12 hours of data copy + 4 hours of redo/undo). Everyone lived happily ever after 😉