Yesterday one of our critical production instances froze for a few seconds. This runs on SQL Server 2000 (SP4+). Even a little downtime would cause lot of turbulence on this instance. The error log did not have any glaring errors except for this one.
DESCRIPTION: Error: 17883, Severity: 1, State: 0 Process 282:0 (1bb8) UMS Context 0x0A37C010 appears to be non-yielding on Scheduler 6
Around the same time frame all applications including the SQL Server Agent timed out while trying to connect to the instance. The services did not restart and the instance was operational in a few seconds.
Some information about User Mode Scheduler (UMS) and SQL Server Operating System Scheduler (SOS) in SQL Server 2005 and above.
- UMS Schedulers are the logical schedulers used by SQL Server against which each worker thread of a SPID are mapped and all the I/O related operations are processed through these schedulers.
- The number of schedulers are usually equivalent to the number of CPUs reported by the Operating System. This will not be true if the affinity mask value is changed from the default in SQL Server 2000.
- An SPID is generated only after the scheduler receives a request. After the SPID has finished the work it releases the ownership of the scheduler and the worker threads associated with that SPID are treated as “yielded”.
- If the owner worker thread of the shceduler does not respond (yield) in 60 seconds, 17883 error (UMS cotext…non-yielding…) is rasied. Only the first occurance of this error will also generate the stack dump.
This article on TechNet has very useful information on 17883 and related errors.
Now coming back to the issue on our server, I had quickly I checked the following.
- CPU and Memory utilization on the instance were normal
- There were no blocking sessions found
- The number of active sessions was ~200 which was less than the average for this instance
- There were no Backup or Maintenance jobs running at that time
- No disk related errors in the System Event Log
- Network team also confirmed that there was no network glitch
Still the schedulers were treated as non-yielding. The output of DBCC SQLPERF(umsstats) also confirmed that the scheduler had exhausted all the worker thread allocation.
Statistic Value ------------------------ Scheduler ID 6.0 num users 15.0 num runnable 0.0 num workers 15.0 idle workers 15.0 work queued 0.0 cntxt switches 1.3390778E+7 cntxt switches(idle) 2.4731256E+7
Since all the symptoms mentioned in this article were ruled out, this article seemed to be more relevant to the issue that we were facing. However, there were no ADD DATAFILE or CREATE DATABASE commands were running on the instance. The only probable issue could have been the data files growing automatically.
In order to rule out this possibility, quickly scanned through the Auto Growth settings on all the databases on the instance. One of the database which was 25 GB in size had only one data file and the auto growth for data file was 100 MB. This database had a growth of ~500 MB per day in the last one month. This meant that this data file would grow 5 times a day and since this instance is SQL Server 2000 the database would be locked during the auto growth. This probably could have been one of the contributed factors for the error in the log.
Now the auto growth setting for this database has been changed to 1 GB which means that the data file would auto grow once in two days. With this change, all the possible contributing factors for 17883 error have been ruled out from Database perspective. After this I have not seen that error on the instance yet. Hopefully this will turn out to be a permanent fix!