Wednesday, December 21, 2011

DB2 : Data inconsistency caused by SQL code -911

Explanation for SQL code -911 : -

THE CURRENT UNIT OF WORK HAS BEEN ROLLED BACK DUE TO DEADLOCK OR TIMEOUT. REASON reason-code, TYPE OF RESOURCE resource-type, AND RESOURCE NAME resource-name

I worked in a mainframe application wherein they were converting all the VSAM files to DB2 tables. They coded the insert logic in CICS programs as shown below. Obviously they coded much more effective exception handling logic and the below code ONLY shows you the crux of what mistake they did in handling the SQL code -911.

MOVE 0 TO RETRY-COUNT.
INSERT-A-ROW-IN-THE-TABLE.
    ADD 1 TO RETRY-COUNT
    ISSUE SQL INSERT STATEMENT
    IF SQL CODE = -911
       IF RETRY-COUNT < 10
          GO TO INSERT-A-ROW-IN-THE-TABLE
       ELSE
          DISPLAY APPROPRIATE ERROR MESSAGES 
          AND ABORT THE PROGRAM
       END-IF
    END-IF.

As per the above code, when the program receives SQL code -911 it tries the insert opertion 10 times before it finally aborts. Obviously they did not realize that when db2 issues SQL code -911 for the first time, it backs out all the db2 updates including the VSAM file updates. So the other transaction (which caused the deadlock for the current transaction) completes successfully as DB2 release all the locks held by our “innocent current transaction”.

So, when the current program retries the insert operation for the 2nd time, it completes the insert operation successfully and rest of the transaction goes fine. Now the problem is, all the updates done prior to -911 are lost and all the updates done after the -911 are retained.

My application experts tried to find out the root cause of the data inconsistency and they are not able to find out the root cause as the transaction executed several programs and there was no repeating pattern in the data inconsistency as they coded this retry logic throughout the system for all the table inserts.

Finally I found this issue and informed application experts and then they changed the logic for -911 to abort program when -911 is received for the first time.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.