Sunday afternoon. Heading back home after spending good time with myself. Cell phone starts ringing. Its from Office!
<Flashback> I remembered reading a text message early in the morning that there was a High Severity incident involving the DBA team, but all was well with the databases.</Flashback>
The DBA on the other side of the call said that the incident was still not resolved and only one application was having connectivity issues to the database. The instance hosts multiple instances and close to 270 databases. Only one application was not able to connect to the database server. Over the phone I asked my team to check the following and everything was fine.
- Database status – ONLINE
- Did the application id had privileges? – Yes, DBO on the database
- Did SQL Server port change? – No, it was running on the listening on the correct port
- Did Error Log had any suspicious entries – No
- Does telnet from App Server to DB Server work? – Yes
- Is App Team able to connect to instance using Client Tools? – They cannot test it because the App server does not have Client Tools installed.
- Was the application service restarted after the server rebooted? – Yes. Despite that the application is unable to connect.
By the time the DBA did these checks, I had reached home. Again got a call that all the stakeholders are waiting for a fix from the DBA Team. I am now used to this behavior of the other support team in our environment. The moment someone whispers that it is a Database issue, every other team stops troubleshooting the issue from their end. Since the ball was in our court now, I logged on to the server to have a feel of the issue.
Everything was absolutely fine from the instance/database perspective. The Application connects to the instance using a particular windows id. This time it was not able to connect. When asked for the error messages with which the application was failing to connect, the support team gave a “custom” error message that the application generates. One more interesting thing was when the application could not connect, there was no message recorded in SQL Server Error Log. Usually when a login fails to connect, there will be a corresponding error logged in the Error Log. The absence of it clearly means that the Application ID is not even touching the instance.
Since telnet was working fine from App server to Database server, the connectivity issue was ruled out.
Now I had to look outside SQL Server to figure out the root cause. I opened Active Directory Users and Computers console (dsa.msc) and searched for the Application ID. Opened the properties of it and I could find the monster hidden behing the gray check box.
Wow! The Application ID was disabled in the Active Directory and you want it to connect to SQL Server? The DBA did not waste any time in announcing this on the Bridge and all the supporting teams woke up from sleep and started running around to enable the Windows Login. Most likely the Application Team had approved a Change Control to disable a Domain User and something got messed up somewhere. The post-mortem will be done later. For now they enabled the Domain Login and there it was connecting to the database!
I wish the System Administrators had spent one minute in checking the Domain Login, it would have saved hours for everyone involved in this incident.