Sunday, June 28, 2026

Uncovering the Hidden Cause: A Db2 -911 Error That Silently Broke Data Consistency

Back in 2009, I was working as a Mainframe Consultant for a leading healthcare insurance company. The client’s system architecture was heavily based on CICS, with around 18 production regions handling healthcare claim adjudication. Each region was responsible for processing claims from specific parts of the United States and ran multiple background tasks simultaneously.

At that time, claim data was primarily stored in VSAM KSDS files, with each CICS region maintaining its own dataset. These records were quite large—around 18 KB each—and contained both the original claim data and the derived or calculated information generated during processing.

Around 2006–2007, the client has introduced Db2 into the claim processing flow. The monolithic VSAM structure was normalized into approximately 12 Db2 tables. These tables were split similarly into two logical groups—one for original claim details and another for derived data. Unlike VSAM, the Db2 tables were shared across all CICS regions, rather than being region-specific.

However, to minimize risk, the organization continued to treat VSAM as the primary data source, with Db2 acting as a shadow repository.

The claim processing flow remained largely unchanged: the system would read the VSAM record at the start, cache it in memory, process it across multiple programs, and finally write the updated data back to VSAM. With Db2 integration, additional SELECT, INSERT, and UPDATE statements were introduced throughout the processing steps to keep Db2 in sync with the cached data.

After deploying these changes to production, application SMEs observed inconsistencies—certain claims had missing or incomplete data in Db2 tables. Interestingly, the issue was not consistent; different tables were affected for different claims. Instead of investigating the root cause, a reconciliation batch job was implemented. This job compared VSAM records against Db2 and corrected discrepancies by inserting or updating missing data.

In 2009, I was assigned to debug a production issue related to claim processing. I set up an Xpediter session and carefully traced the execution of a claim through the system.

During debugging, I encountered a recurring Db2 SQL error: SQLCODE -911, which indicates a rollback due to a deadlock or timeout. I continued the session and noticed that, after processing, certain Db2 tables were missing data for the same claim.

This was a critical observation. The -911 error triggers a rollback, which means any previous INSERT or UPDATE operations in that logical unit of work are undone.

Further analysis revealed that the application had retry logic for handling SQLCODE -911. Whenever this error occurred, the program would retry the same SQL statement up to five times. If any retry succeeded, processing would continue as if nothing had happened.

The problem?

This retry logic was implemented for SELECT statements.

As a result:

A -911 error would roll back prior updates toDb2 tables.

A subsequent successful retry (on a SELECT) allowed processing to continue.

This led to partial or missing data in Db2 tables.

This flaw had gone unnoticed for nearly two years after Db2 integration.

When I presented my findings to the application SMEs, there was initial skepticism, partly because I was relatively new to the team. However, I substantiated the findings using official Db2 documentation explaining SQLCODE -911.

Eventually, the team acknowledged the root cause. The resolution involved significant changes to remove inappropriate retry logic and ensure proper handling of transactional failures.

This experience reinforced an important lesson:

Retry mechanisms must be carefully designed—especially in transactional systems—otherwise, they can silently introduce data inconsistencies.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.