No transaction is active message when accessing Linked Server

Last week I had worked on an issue related to Linked Server. The customer had migrated the SQL Server Instances to Virtual Servers. They had quite a few Linked Servers setup. After migration any Distributed Transaction like the one below across the linked servers would fail immediately.

begin distributed tran
select * from RemoteServer.DBName.dbo.TableName
commit tran

Error message

OLE DB provider "SQLNCLI11" for linked server "linkedservername" returned message "No transaction is active.".
Msg 7391, Level 16, State 2, Line 2
The operation could not be performed because OLE DB provider "SQLNCLI11" for linked server "linkedservername" was unable to begin a distributed transaction.

The first place I checked for problems was the Component Services (run –> dcomcnfg).

LocalDTCProperties

The options in the Local DTC Properties were correctly set as seen in this screenshot.

DTCProperties

Restarting the “Distributed Transaction Coordinator (MSDTC)” service didn’t help either.

The next step was to look for possible error messages in the Event Log. In the Application Event Log, the following error message was logged.

The local MS DTC detected that the MS DTC on ServerName has the same unique identity as the local MS DTC.
This means that the two MS DTC will not be able to communicate with each other.
This problem typically occurs if one of the systems were cloned using unsupported cloning tools.
MS DTC requires that the systems be cloned using supported cloning tools such as SYSPREP. Running 'msdtc -uninstall' and then 'msdtc -install' from the command prompt will fix the problem.
Note: Running 'msdtc -uninstall' will result in the system losing all MS DTC configuration information.

The error message in bold indicates that the Unique Identity for the MS DTC (SID) was same on both the local and the destination servers. How is this possible? While the new servers were being built they had syspreped servers. Hence the configuration of MS DTC was also propagated to all the servers where the same image was used.

Now that we knew the root cause, the resolution was pretty straight forward. Executed the following steps as explained in this article.

    • Opened the Command Prompt as an Administrator and executed “msdtc -uninstall”
    • Deleted the following registry keys (after exporting them as a precautionary measure)

HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\MSDTC
HKEY_CLASSES_ROOT\CID
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\MSDTC

  • Executed “msdtc -install” command in the Command Prompt
  • Rebooted the server

After reboot, the linked server queries returned the expected results.

Application timing out due to excessive blocking on tempdb PFS

Problem

Applications Connecting to SQL Server timing out

Symptoms

  1. Excessive blocking with the wait resource being tempdb PFS page
  2. High CPU utilization on the server

Troubleshooting

sys.dm_exec_requests reported blocking and the head blocker looked like this

session_id ecid wait_type resource
---------- ----- --------------- --------------------
58 4 PAGELATCH_UP 2:7:32352
58 1 CXPACKET
58 5 CXPACKET
58 6 CXPACKET
58 3 CXPACKET
58 2 CXPACKET
58 0 CXPACKET exchangeEvent id=Pipe7dd014b80

The wait resource was always reported at tempdb PFS page (2:7:(8088*x)).

The head blocker and the blocking session were executing a SELECT against a XML document.

As explained here parsing a XML document would create work table in tempdb. In this situation the application was parsing a lot of XML documents. Hence the contention on tempdb was justified.

  • tempdb had 5 data files and the data files had initial size of 100 MB (with autogrowth of 10%). As per the recommendations here, increased the tempdb files to 8 and increased the initial size to a higher value to reduce the auto growth.
  • Also enabled trace flag 1118. This forces uniform extent allocations instead of mixed page allocations.
  • Enabled trace flag 1117. This will auto grow all the files in the filegroup whenever one of the files tries to auto grow.
  • Since this instance was running on SQL Server 2012, applied the latest Service Pack + CU as recommended in http://support.microsoft.com/kb/2964518

The tempdb contention got reduced to some extent still the CPU utilization was very high(above 85%). Because of this the application was still timing out.

  • Changed the Power Plan to “High Performance” as explained in http://support.microsoft.com/kb/2207548. This didn’t help much.
  • As you notice in the sys.dm_exec_requests output, all threads in the session are waiting on CXPACKET wait type for one thread to complete its work. Hence set the appropriate value for max degree of parallelism (3 in this case) as explained here.

CPU utilization came down from 85% + to 20-30%

Status

Mission accomplished!

SQL Server Cluster resource fails with “Data source name not found and no default driver specified” error

Last week I had worked on SQL Server 2008 R2 installation on a Windows Cluster. The installation would go fine until the installer tries to bring the SQL Server resource online. At that time it would fail with the following error message,

The cluster resource 'SQL Server' could not be brought online. Error: The group or resource is not in the correct state to perform the requested operation. (Exception from HRESULT: 0x8007139F)

On clicking OK, the installation would not fail but the SQL Server resource would continue to remain in failed state. Looking at System/Application Event Logs wasn’t of much help. From the SQL Server error logs it was evident that the Service was getting started successfully but it was getting stopped after the Cluster Manager failed to bring the resource online after 10 attempts.

The cluster log revealed that the resource was failing to come online with the following error

ERROR [IM002] [Microsoft][ODBC Driver Manager] Data source name not found and no default driver specified

Since SQL Server service is working fine, the problem is for sure outside SQL Server. The resolutions mentioned here and here, did not help.

As we already know, the Windows Failover Cluster manager uses SQSRVRES.dll to connect and manage the SQL Server Cluster resource. Hence looked at the properties of SQSRVRES.dll to check if there was any problems with it. This DLL was just fine but its version was 11.0.2100. This is strange since the version of SQL Server that we were trying to install was SQL Server 2008 R2 (10.50.xxxx). On checking customer informed that he had tried installing SQL Server 2012 earlier on this server but it was uninstalled for some reasons.

This clearly seemed to be the root cause. Hence copied the SQSRVRES.dll from a server which already had SQL Server 2008 R2 installed and overwrote it on the server in question. After this the installation succeeded and the SQL Sever cluster resource also came online just fine.

CDC job failing with “Unable to add entries to the Change Data Capture LSN time mapping table to reflect dml changes applied to the tracked tables” error

I got a chance work on CDC job failure issue recently. The customer had configured CDC on the database but it wasn’t working as expected. Hence he had disabled/enabled CDC multiple times on this database.

The “cdc.[DBName]_capture” job was failing with the following error

Message: 22858, Level 16, State 1
Unable to add entries to the Change Data Capture LSN time mapping table to reflect dml changes applied to the tracked tables. Refer to previous errors in the current session to identify the cause and correct any associated problems. [SQLSTATE 42000] (Error 22858) The statement has been terminated. [SQLSTATE 01000] (Error 3621). NOTE: The step was retried the requested number of times (10) without succeeding. The step failed.

Since the error message had reference to “Change Data Capture LSN time mapping table”, I looked into the cdc.lsn_time_mapping table. On a normal CDC configuration this table would have an entry for each transaction that was captured. But in this case, there was only one entry but the “tran_begin_time” and “tran_end_time” columns had a value which was a couple of days old. Also the tran_begin_lsn was printed as 0x00000000000000000000 and tran_id was 0x00. This isn’t normal.

To get more details about this error, I added a verbose log to the CDC capture job.

The verbose log printed the following message when the job was running.

session_id error_message
----------- ----------------------------------------------
5 Violation of PRIMARY KEY constraint 'lsn_time_mapping_clustered_idx'. Cannot insert duplicate key in object 'cdc.lsn_time_mapping'. The duplicate key value is (0x0008236700032c170001). CF8:0005
5 Unable to add entries to the Change Data Capture LSN time mapping table to reflect dml changes applied to the tracked tables.

From the verbose log it was evident that the CDC capture job was trying to insert a duplicate row the object cdc.lsn_time_mapping and failing.

The sys.databases DMV reported that the log_reuse_wait_desc was REPLICATION.

name database_id log_reuse_wait_desc
---------------- ----------- ---------------------
master 1 NOTHING
tempdb 2 NOTHING
model 3 LOG_BACKUP
msdb 4 NOTHING
CDC_DB_NAME 5 REPLICATION

This indicated that another CDC or replication was active on this database. Since there was no replication configured on this instance, it had to be a CDC job.

DBCC OPENTRAN reported that there was another CDC transaction was active on this database

Transaction information for database 'cdc_db_name'.
Replicated Transaction Information:
Oldest distributed LSN : (0:0:0)
Oldest non-distributed LSN : (533351:207666:1)
DBCC execution completed. If DBCC printed error messages, contact your system administrator.

Since customer had attempted enabling/disabling CDC multiple there was a possibility that a stale CDC transaction was still active on the database. To clear the article cache of CDC, executed sp_replflush against the database in question.

After restarting the job, it again failed but with a different error message now.

Another connection is already running 'sp_replcmds' for Change Data Capture in the current database.

DBCC OPENTRAN reported that there was a open transaction on the database but it was a user transaction which wasn’t running sp_replcmds.

Executed sp_replflush one more time and restarted the job again. This time it didn’t fail and also we started seeing rows getting inserted into all the CDC related tables.

Since the issue at hand was resolved, didn’t delve into the root cause. Most likely it was a stale CDC transaction which wasn’t cleaned up when CDC was disabled on the database.

Export the Event Logs without opening MMC

I recently worked on a database corruption issue. In order to troubleshoot this I had to collect all the Event Logs from the server. But when I tried to open Event Viewer, MMC would fail with the following error.

MMC could not create the snap-in.
Name: Event Viewer
CLSID: FX:{b05566ad-fe9c-7a4cbb7cb510}

For that matter any MMC snap-in on the server would fail with the same error. But getting the event logs was critical to get troubleshooting going. Thats when Wevtutil.exe which ships with the Windows Operating System came to the rescue.

Wevtutil.exe is by default located in the C:\windows\system32 folder.

I had to just execute the following command to export the System Event Log to C:\SystemLogBackup.evtx

wevtutil.exe epl System C:\SystemLogBackup.evtx

The “epl” parameter exports the event log specified (System in this case) to the destination file.
All I had to do is copy the exported file to my local desktop and double click it to open in the Event Viewer snap-in.

SavedLogs

This is indeed a good tool to have in a DBA’s armory.