Tag Archives: hardware

Server is running in PAE mode without /PAE switch?

As I was reading one article related to configuring memory on SQL Server instances, I recollected a question asked by one of my team members sometime back. His doubt was a genuine one.

We have 4 Data warehouse servers which have quite a lot of CPUs and huge amount of memory. Those servers run on 32 bit Windows 2003 server operating system and have ~32 GB of RAM. For SQL Server to recognize memory greater than 4 GB, the /PAE switch needs to be enabled in boot.ini file in 32 bit systems. After which AWE will be enabled in SQL Server. But on these servers SQL Server was recognizing all the RAM available but there was no /PAE switch in boot.ini. He was curious to know how SQL Server was recognizing 32 GB of RAM without the /PAE switch.

Even though I had told him that I would check the settings on the server, that had slipped out of my mind. Today I had logged on to one of those servers and I was able to get an answer for his question.

There are two Windows/Hardware level settings even if one of them is enabled /PAE switch need not be enabled in boot.ini. Operating system will automatically enable PAE mode. They are Data Execution Prevention (DEP) and Hot Add Memory devices.

There are two modes of Data Execution Prevention. Hardware enforced DEP and Software enforced DEP. If the Processor supports DEP the Hardware enforced DEP is enabled and /PAE switch need not be enabled. In Software enabled DEP the switch /noexecute=[policy_level] needs to be enabled in boot.ini. The details of DEP and the policy_level switch parameters are described in this article.

If the server supports Hot Add Memory feature, PAE mode is automatically enabled. Hence there is no need of /PAE switch in boot.ini.

Now coming back to the server in question. As mentioned earlier there was no /PAE switch in boot.ini. Here is how the boot.ini looks on that server.

As evident from the screenshot, Software DEP is enabled on that server using the /noexecute=optout switch. This is the reason why the Server is running in PAE mode even without the /PAE switch. Once PAE mode is enabled, AWE needs to be configured in SQL Server so that SQL Server recognzies all the RAM installed on the server.

Backup failed due to Time-out error while waiting for buffer latch

Today morning a differential backup job failed on one of the instances running SQL Server 2005. The error message with which it failed is as below.

Msg 845, Level 17, State 1, Server [ServerName], Line 2
Time-out occurred while waiting for buffer latch type 4 for page
(9:8677305), database ID 7.
Msg 3013, Level 16, State 1, Server [ServerName], Line 2
BACKUP DATABASE is terminating abnormally.

The first thing I did was to check which processes are currently active on that database.

SELECT * FROM sys.dm_exec_requests WHERE database_id = 7

The query gave this output.

Look at what session id 109 is doing. DBCC SHRINKFILE! There was a request from the Application team sometime back, that we keep shrinking the primary file group until there is no free space left. We had informed them of the performance issue that this process will have on the instance. But they wanted to reclaim the space on that filegroup, since they had moved the bigger tables to a new filegroup. After all, every MB of SAN storage is $. We had setup a job which runs DBCC SHRINKFILE on each of the data files in the Primary File group.

How is SHRINKFILE affecting the backup? As per this KB article when the instance is under heavy stress load, users will experience Error 844 and Error 845. It also mentions that Frequent database AutoGrow or AutoShrink operations can also be one of the causes. In our case AutoGrow property was properly set. But the “shrink job” was causing more damage even though the AutoShrink option was not enabled.  In order to check the number of locks held on the database, I executed the below query.

select request_mode, count(*) as Total
from sys.dm_tran_locks
where resource_database_id= 7
group by request_mode
order by count(*) desc

The output was

The number of X (exclusive) locks held on the database was 107. Since we were sure of the reason for the backup failure, we went ahead and stopped the “Shrink Job”. I executed the same query after stopping the job and the number of X locks dropped to just 1!

As expected the differential backup job completed without any issues after stopping the shrink job. This is one of the many negative impacts of shrinking a database.

Since this database is shrinking at a rate of 20 GB/day, my team will have to deal with these issues at least for the next one year!

Find number of Physical CPUs using a query

Few days ago I had explained how to find the number of Physical CPUs installed using third party tools. Just a while ago, I figured out that using sys.dm_os_sys_info DMV that information can be obtained far more easily. I executed the following query on a server having 3 dual-core CPUs on it.

SELECT
cpu_count,
hyperthread_ratio,
cpu_count /hyperthread_ratio as [Physical CPUs]
FROM sys.dm_os_sys_info

As expected the DMV reported the information accurately and here is the output. Very useful.

However even this DMV does not detect quad-core CPUs. I tried it one of the servers that I support. As of now, CPU-Z holds the upper hand in detecting the Physical CPUs.

Find the number of Physical CPUs

While troubleshooting high CPU usage on one of the instances, someone pointed out that the “Max Degree of Parallelism” option was set correctly or not. It was set correctly to 4 and it did not contribute to the high CPU usage. Then the discussion went on to find out How many Physical Processors does the server have? Even though this question did not have much importance in this context, it was a good one. Because this information is not easy to get without using 3rd party tools.

The number of CPUs the Task Manager, Windows Device Manager, Perfmon display are the number of Logical CPUs. The Processors now a days are Dual Core/Hyper-threaded. The number of CPUs that SQL Server detects are all Logical Processors.

Here is how I find out the number of Physical CPUs installed on a given server. The server in the screenshot has 6 Logical Processors.

One of the options to find out the number of Physical CPUs installed is to use Process Explorer. In Process Explorer clicking on View –> System Information option would display the System Information in General. Clicking on Show one graph per CPU option would display the information as in this screenshot.

As highlighted in this screenshot, the Logical CPUs are grouped together according to the number of Physical CPUs. We have 3 such groups which in turn means that the Server has 3 Dual Core CPUs, hence the SQL Server is recognizing 6 Logical CPUs.

But on some servers I have seen that Process Explorer does not display the information correctly. Like in the below example where Windows Recognizes 8 Logical CPUs and Process Explorer also treats each one of those CPUs to be different Physical CPUs.

For sure I know that this server does not have 8 Physical CPUs installed on it. Welcome CPU-Z, the free tool which reports CPU related information far more accurately.

As in this screenshot, the number of processors listed is 2 and those processors are hyper-threaded to run in 4 threads. Thus the number of logical CPUs is arrived as detected physical CPU * number of threads. In this case the number of Logical CPUs would be 2 * 4 = 8.

The information related to the Physical CPUs can be retrieved in many ways from BIOS, through a custom application (like the one of Slava Oks) and so on. But I like to use freely available tools like CPU-Z which do the job for me with ease.

Get Table name from data file offset address

I am having really tough time on two of the production servers in our environment. The storage related issues on these servers are causing problems for the databases. There have been many changes to the environment from storage perspective like firmware upgrade, storage migration to name a few. But the issues seems to be going from bad to worse.

Last week, on the instance which hosts the biggest data warehouse database of 10 TB in size we started getting messages related to disk subsystem. The worst part was that the storage demon had caught the 10 TB database itself! The error message read

Error: 823, Severity: 24, State: 2
I/O error (bad page ID) detected during read
at offset 0x00005a0b020000 in file 'M:Data************_Data.NDF'

This error message would appear in the SQL Server Error Log at least 10 times a day. No corresponding error message in the System Event Log from the Disk Subsystems. As usual the storage team informed that “all looks good from Storage perspective”.  Even a single occurrence of this message is enough send shivers down the spines of a DBA. Recurring messages meant that the DBA had to act quick and fast.

The first that a DBA a would do is to run DBCC CHECKDB against the database. As mentioned earlier, this database is 10 TB in size and running CHECKDB against this database would take at least 48 hours (from the recent experience on this database). Since the customer does not want his “load” jobs to be held for such a long duration, the other option was to run DBCC CHECKFILEGROUP. The file group on which this datafile was residing constituted 90% of the total database size! Hard luck.

The last option was to get to the table which was affected by this issue. But the error message did not have the Page ID, so that we could run the life saving DBCC PAGE command to get the table name. All it had was the hexadecimal offset id 0x00005a0b020000 for that file. Is there an option to get the Page ID from the offset?

The official documentation gets the due respect only at the time of crisis. In this case this KB article on 823 error was more than useful. With the help of this article, I was able to get to the affected Page ID. Here is how.

Launched Calc.exe and divied the hexadecimal value 0x00005a0b020000 by 8192

Now that I got a value of B1E749 all I need to do was to convert it to Decimal by clicking on the Dec radio button. The decimal equivalent of this was 11659081.

Now the Page ID is ready to be fired using the below query. The database ID is 8 and the File ID of the file in question is 37.

DBCC TRACEON (3604)
go
DBCC PAGE (8,37,11659081,3)
go

This query gave the m_objid, i.e. the ID of the object which resides on this page.

The object id is also ready now and the table name is just a F5 away!

SELECT OBJECT_NAME(1711345161)

Now the task is simpler and quicker. A Change Control to run DBCC CHECKTABLE against this table is awaiting customer approval now.

The hole has already been dug up in this database, the output of CHECKTABLE will confirm how deep it is. Keeping my fingers crossed!

Implicit Transactions and the performance issue

Last week, users of one of the application started complaining of performance issue. As always, the application team will reach out to the DBA Team to check the Database Server’s performance.

The first thing that we observed was that there was lot of blocked sessions. All the sessions from that application were getting blocked and there was a lead blocker. The worst part was that the lead blocker would change every minute !

Apart from this, the Server performance was normal. The CPU, Memory usage was minimal and there were not noticeable errors in any of the logs.

It was very clear that the blocking sessions are degrading the performance of the applications. As a first step, all the tables on the database were re-indexed, followed by updating the statistics. Now the DBA on call proudly announced “Please check the application now” assuming the magic wand had worked. It was not to be. Same issue and the blocking sessions continued to pile up. As mentioned earlier, the lead blocker would change every minute.

However, there was a  query on a particular table which would be the lead blocker many of the times. When the DBA was about relax that he had found out the root cause, it was found out that this table had only 23 rows ! Moreover, all the tables’ statistics were up to date and there was hardly any fragmentation.

Now we assumed something from the application side had changed. No, nothing was changed on the application. It was the same code that they had been using for years.

Now the ball was again the DBA’s court. While checking for blocked sessions, one thing was noticable. The status for all the blocked sessions was Awaiting Command. A sessions with Awaiting Command status is a client connection with no query to the SQL Server. Strange! No query to execute still getting blocked on the instance?

This was a kind of confirmation that nothing was wrong from SQL Server perspective. However, a Profiler trace was run for the duration when the sessions are getting blocked. This was the key step that helped us to rule out any SQL Server issue. All the sessions that were getting blocked and the sessions which blocked other had one thing in common. All sessions were Implicit Transactions. The query had the below line as the first thing.

set implicit_transactions on

The definition of Implicit Transaction on MSDN is as follows.

When a connection is operating in implicit transaction mode, the instance of the SQL Server Database Engine automatically starts a new transaction after the current transaction is committed or rolled back. You do nothing to delineate the start of a transaction; you only commit or roll back each transaction. Implicit transaction mode generates a continuous chain of transactions.

If there is an INSERT statement running as an Implicit Transaction, then a separate transaction is created for that. If it was an explicit transaction, several INSERTs would be wrapped together in one transaction. Also for an Implicit Transaction, SQL Server needs to write the transaction log to disk every time . The session running the Implicit Transaction will remain open until a COMMIT/ROLLBACK transaction command is issued.

Since our case, each of the sessions getting blocked were Implicit Transactions. For example one session with Implicit Transactions on, selects data from a table and holds the lock. Until this session receives the commit/rollback transaction command from the client, the session will remain open on the instance. Since this transaction is holding the locks on that table, the subsequent sessions requesting data from that table will be blocked.

By now, it was confirmed that there is delay between the client and database server communication. Hence, the network team was called in to check the performance. They finally figured out that one of the switches had gone bad. The switch was replaced and the blocking sessions disappeared !

For a change, the DBA team did not have to the Root Cause Analysis.

Storage Fundamentals – RAID

Starting today, I am attending a two day training on Storage Networking. With the word storage, the first word that comes to mind is RAID. Here is some of the basic information about Redundant Array of Independent Disks (RAID) and various types of RAID.

RAID 0 – Non redundant striping

  • RAID 0 implements a striped disk array. If the host passes the information “ABC” to a RAID 0 which consists of 3 physical disks A, B and C are written to each of the disks in a round robin fashion.
  • Important point to be noted here is that the disk size should be of equal, since the information is striped across each of the disks. If we have 3 disks of 100 GB, 100 GB, 50 GB size, then the total size of the logical disk (RAID 0) is 50+50+50 = 150 GB.
  • RAID 0 is not a true RAID since it is not fault tolerant. If one of the disks fails, then the logical disk will also fail. But RAID 0 provides the best performance.

RAID 1 – Mirroring

  • RAID 1 also known as Mirroring, provides a shadow copy of one disk on the other disk
  • This provides the best fault tolerance, since the mirror copy of the original disk is available, in case one of the disk fails.
  • Downside to RAID 1 is that the other disk involved in the RAID is wasted (considering cost)

RAID 3 – Disk Array with bit-interleaved data

  • In RAID 3, data is divided into octets and striped on the disks
  • For example, if we have 4 disks one of the disks contains the parity information. i.e. the result of XOR operation of the 3 disks is saved in this disk
  • If one of the disks fails, fault tolerance is provided by the parity information. It is like the 4th disk has the information 1+2+3=4. Imagine disk 3 fails, so the formula now looks like 1+2+x=4. The value of “x” in this expression can easily be evaluated.
  • If more than one disk fails, all data in the entire array is lost

RAID 4 – Disk array with block interleaved data

  • This is very similar to RAID 3, works on BLOCK basis instead of octet basis. Block is bigger than octet

RAID 5 – Disk array with block level distributed parity

  • This is the most commonly used RAID level
  • This also works on parity logic but the parity information is spread across all disks, unlike RAID 3 where the parity information is stored on only one dedicated disk
  • At least 3 disks are needed to constitute a RAID 5 array
  • Data reads are very fast which is similar to RAID 0
  • But data Writes are comparatively slow, since the parity information and disk on which parity needs to be stored is calculated

RAID 10 – RAID 1 + RAID 0

  • RAID 10 uses both Striping (RAID 0) and Mirroring (RAID 1)
  • Mirrored volume of Striped Disks is created
  • At least 4 disks are required to implement RAID 10
  • Multiple RAID Groups (RG) are created. If there are 4 disks, 2 RGs are created. Hence RAID 10 can survive multiple drive failures (one per RAID 1 RG)

More detailed information on RAID with supporting diagrams can be found at Wikipedia.