T-SQL Tuesday #98 – Technical challenge

This month’s TSQL2sday is hosted by BlobEater (Arun Sirpal b|t) and the topic is about solving a technical challenge you faced in the past.

It’s another great subject for writing a blog post. Troubleshooting, issue solving, and root cause analysis are very exciting aspects of every DBA’s work. It will be a great experience to read all these posts describing various kinds of issues that DBAs from around the world have to deal with.

However, it’s not an easy task to write such post. I had plenty of ideas what I could describe but after a moment of thought, everything seemed to be obvious and not interesting at all. Finally, after few days of intensive thinking, I decided what to write about.

“Application is freezing”

Three years ago I was working with a customer who was reporting an application freeze issue. A few weeks before that they upgraded the application to the newer version. Since then, everything was fine for some time, but recently they started to observe an application freeze issue. Each time it occurred, the application was unresponsive for most of the users. It was very non-deterministic behavior. Freeze appeared once every few days at different times of day. Usually, they took few minutes and always were resolved without any manual intervention.

At the very beginning, the major part of investigation activities was focused mainly on the network and a Citrix farm. Unfortunately, few days of checks gave nothing. After eliminating every subsystem from application infrastructure one by one only the database server was left.

“It has to be the database!”

The database server was quite powerful with 160 logical processors and 512 GB of Memory. Attached SAN storage also was very good with 0-2 ms of latency. According to logs, the server was never utilized in 100% so hardware was not our main concern…

The biggest problem with the investigation of the database were irregular occurrences of the freeze. Every time it occurred and we were called by the customer, the issue was already gone before we connected. That was irritating. We were never able to check what is happening on the system during the impact.

At that time this particular database server was not monitored by any SQL Server monitoring tool, so we didn’t have many logs describing server activity. We reviewed all available logs, scheduled jobs, and other regular activities to verify if anything could be involved. Unfortunately, it didn’t give us any new findings.

Finally, we decided to deploy some monitoring mechanism which would collect data and give as good situation overview for our after-the-facts analysis.

Diagnostic data collection

We decided to use PSSdiag. It’s a free data collection utility developed as a side project by Microsoft. It’s mainly used by Microsoft Customer Support Service and Product Support Service teams. It’s not the simplest tool to configure, but it collects all required information, like PerfMon counters, SQL Trace, error logs, and diagnostic data about SQL Server state, health, and its performance.

It took us a while to create a proper configuration because the volume of data being collected was huge. We also had to test it very well, because we couldn’t affect customer’s SQL Server performance. Finally, we ended up with the configuration which in this particular environment was collecting 0,5GB of data per minute (on average). Data retention was driven by fixed maximum data size because we didn’t want to consume all free space on one of SQL Server drives. Because of it, the collected time frame was varying from 1 hour up to max 2 hours. That gave us about 30 to 90 minutes to collect this data after the customer called us.

When the customer called us a few days later informing about another freeze occurrence, we had bad luck. Data from issue time period was already gone – overwritten by newer ones. “Fortunately”, two days later there was one more occurrence, which this time was captured properly. I copied this data out from the server and was ready for analysis.

Captured Data Analysis

To analyze data captured by PSSdiag I used a tool called SQLNexus. It’s another tool created as a side project at Microsoft. It processes the output from PSSdiag into a database and runs reports on top of it.

The result was worth all the configuration and testing efforts. At first glance, I knew that we had captured what we needed.

As you can see in above screenshot, something wrong happened around 16:05 and it lasted until 16:20. During this time period, the number of Started and Completed Batches dropped significantly and the average duration of execution increased. A large amount of Attentions was also noticeable. When I opened another report with blocking information everything became clear.

There was blocking situation that lasted for almost 15 minutes and blocked a total 15314 other SQL Server sessions! This is madness…

Blocking Chain Detail report revealed additional important information. One of our small, utility applications used to manage user data (let’s call it App02) was identified as a head blocker (SPID 137). The application was not used very often by the customer, which explained why freezes occurred in so non-deterministic way. However, most interesting information was hidden somewhere else… It’s the Transaction Name: implicit_transaction. What…?

Let’s dig a little bit deeper and check what code App02 application executed.

Insert statement to the [dbo].[User] table was identified as the last query that our application executed. But why would an insert statement cause such behavior?

That bothered me a lot. I needed to know what happened in session 137 before this insert statement. Because I wasn’t able to find this kind of data in SQLNexus reports I decided to get it directly from the database. I had to spend some time to make myself familiar with the table structure created by ReadTrace tool from RML Utilities (used by SQLNexus during data import process). Finally, I developed bellow code snippet:

DECLARE @startTime datetime
DECLARE @endTime datetime
DECLARE @blockThresholdInMs int

SET @startTime = '2014-07-08 15:00'
SET @endTime = '2014-07-08 17:00'
SET @blockThresholdInMs = 20000 --20sec

-- Get blocking chain
SELECT * FROM dbo.tbl_BLOCKING_CHAINS 
WHERE blocking_start BETWEEN @startTime AND @endTime
AND max_wait_duration_ms > @blockThresholdInMs

DECLARE @blockingStartTime datetime
DECLARE @blockingEndTime datetime
DECLARE @blockingSession int

SELECT @blockingStartTime = blocking_start,
@blockingEndTime = blocking_end,
@blockingSession = head_blocker_session_id
FROM dbo.tbl_BLOCKING_CHAINS 
WHERE blocking_start BETWEEN @startTime AND @endTime
AND max_wait_duration_ms > @blockThresholdInMs

-- Get connection info for blocking chain header
SELECT * FROM ReadTrace.tblConnections
WHERE Session = @blockingSession
AND @blockingStartTime BETWEEN StartTime AND EndTime

DECLARE @connectionStartTime datetime
DECLARE @connectionEndTime datetime

SELECT @connectionStartTime = StartTime,
@connectionEndTime = EndTime 
FROM ReadTrace.tblConnections
WHERE Session = @blockingSession
AND @blockingStartTime BETWEEN StartTime AND EndTime

-- Get statements executed by head blocker
SELECT ub.NormText, ub.OrigText, * FROM ReadTrace.tblBatches b
left join ReadTrace.tblUniqueBatches ub on b.HashID = ub.HashID
WHERE session = @blockingSession
AND StartTime BETWEEN @connectionStartTime AND @connectionEndTime
order by b.BatchSeq

-- Get additional events (like Exception) for this session in this time frame
SELECT * FROM ReadTrace.tblInterestingEvents
WHERE Session = @blockingSession
AND StartTime BETWEEN @connectionStartTime AND @connectionEndTime
AND Error IS NOT NULL

Maybe it’s not the most beautiful piece of code I wrote, but it did its job. It displays data about head blocker session, executed queries, and other interesting events (like exceptions) that could have taken place. Here are the results (click to enlarge):

You can see that after connecting to the database App02 application executed SET IMPLICIT_TRANSACTIONS ON, then executed one more SELECT statement and that’s it. Nevertheless, there is one more thing. At 16:03:52 an error 2601 was reported. It is the time when INSERT statement to the [dbo].[User] table was executed. So what is the error 2601? Some of you probably already know it. I didn’t, so I had to look for it in the documentation. It turned out that it is unique index violation. OK, let’s summarise what we already know:

App02 connects to the database
It executes some statements
It SET IMPLICIT_TRANSACTIONS to ON
Then it tries to insert data to table, but fail because of unique index violations

So far so good, but why the transaction was not rolled back?

“The problem is the code”

I had to look for answers to this question elsewhere. Namely in the application source code. It took me a while, but I was able to find c++ code that was responsible for this. Its simplified version looks that way:

GetDatabase()->BeginTrans();

TRY
{
      result=Data::Put();
}
CATCH (CDBException, e)
{
      ShowMessageBox(e->m_strError);
      result=FALSE;
}                
END_CATCH

if (result)
{
      TRY
      {
          //Do sth else
      }
      CATCH (CDBException, e)
      {
          ShowMessageBox(e->m_strError);
          result=FALSE;
      }
      END_CATCH
}
if (result)
      GetDatabase()->CommitTrans();
else
{
      GetDatabase()->Rollback();
}

Do you already see it? I hope you do. In case whenever something goes wrong an error message in a dialog window is displayed to the application user. The opened transaction is not rolled back until the user closes this window.

And this is exactly what happened. Application user wanted to create a new user in the system. Unfortunately, due to some constraints, this process failed because new entry he or she created was not unique. An error message was displayed, but the user didn’t close it immediately. Instead of this, he or she started to browse application user manual for additional information and problem resolution.

After this discovery, new application patch was ready the next day and immediately deployed to the customer environment.

Summary

Guys, remember to educate application developers and teach them to do not implement any unnecessary steps in the database transaction scope. Especially if they require user action or input. The better idea is to not open database transaction from the application code, but that’s another story. Next advice is to make yourself familiar with tools I mention in this post. All of them are free and half of them is open source and available on Github. If you cannot afford professional SQL Server monitoring tool that may be your salvation.

PS: That whole story reminds me of one tweet I saw recently

"What if I change this index"
The problem is the code
"If I update these statistics"
The problem is the code
"How about this index"
The problem is the code
"Can I use an INCLUDE"
The problem is the code
"I can't change the code"
The problem is the code

— Grant Fritchey (@GFritchey) January 5, 2018