Sunday, July 5, 2026

Understanding "Normal Disposition" and "Abnormal Disposition" in the JCL DISP Parameter

 When learning the JCL DISP parameter, beginners often get confused about the terms Normal Disposition and Abnormal Disposition.
 
Every job step that is executed generates "return code". Famous return codes are 0,4,8,12,16..
 
If the job step does NOT abend, "normal dispostion" comes into effect. If the job step generates return code such as 8/12/16(any return code for that matter) still that is considered "normal disposition"
 
IF the job step abends with Sxxx (such as S0C4, SOC7, SB37, SE37, S322 etc) , Uxxx(user abend), "abnormal disposition" comes into effect.
 
 
This can be illustrated with following example
 
//STEP1   EXEC PGM=IDCAMS                                       
//SYSPRINT DD SYSOUT=*                                          
//DD1      DD DSN=USERID.STEP1.DSN1,                           
//            DISP=(NEW,CATLG,DELETE),                          
//            LRECL=80,RECFM=FB,                                
//            SPACE=(TRK,(1,1),RLSE)                            
//DD2      DD DSN=USERID.STEP1.DSN2,                           
//            DISP=(NEW,CATLG,DELETE),                          
//            LRECL=80,RECFM=FB,                                
//            SPACE=(TRK,(1,1),RLSE)                            
//SYSIN    DD *                                                 
 SET MAXCC=16                                                   
//*                                                             
//STEP2   EXEC PGM=SORT                                         
//SORTIN   DD DUMMY,LRECL=80,RECFM=FB                           
//SORTOUT  DD DUMMY,LRECL=80,RECFM=FB                           
//SYSOUT   DD SYSOUT=*                                          
//DD1      DD DSN=USERID.STEP1.DSN2,                           
//          DISP=(OLD,CATLG,DELETE)                             
//SYSIN    DD *                                                 
 INTENTIONALLY FORCING ABEND BY NOT GIVING PROPER CONTROL CARD  
//*                                                             
 
The below JOB Log messages indicates that even thoug STEP1 genarated return code 16, still USERID.STEP1.DSN1, USERID.STEP1.DSN2 datasets are cataloged.
 
IEF142I USERIDS STEP1 - STEP WAS EXECUTED - COND CODE 0016                  
IEF285I   USERID.USERIDS.JOB04476.D0000103.?         SYSOUT                
IGD104I USERID.STEP1.DSN1                           RETAINED,  DDNAME=DD1   
IGD104I USERID.STEP1.DSN2                           RETAINED,  DDNAME=DD2   
 
The below JOB Log messages indicates that since STEP2 abended with abend code U016, USERID.STEP1.DSN2 was deleted as coded in the DISP parameter.
 
IEF472I USERIDS STEP2 - COMPLETION CODE - SYSTEM=000 USER=0016 REASON=00000000    
IEF285I   USERID.USERIDS.JOB04476.D0000104.?         SYSOUT                      
IGD105I USERID.STEP1.DSN2                           DELETED,   DDNAME=DD1   

What happens when you use DUMMY parameter for a dataset

When the DUMMY parameter is specified for a dataset, no disk or tape resources are allocated to that dataset, and no I/O operations are performed against it.

During batch job testing, I frequently use the DUMMY parameter for output datasets that would otherwise contain millions of records and are not required for validation. By eliminating the creation of these unnecessary output files, the job avoids the associated I/O overhead, resulting in shorter execution times of the job

The following example illustrates the behaviour of DUMMY parameter.

Sample job with the DUMMY parameter specified for the output file

//STEP1 EXEC PGM=SMF30ASM    
//STEPLIB  DD DSN=USERID.LOAD,DISP=SHR        
//SYSPRINT DD SYSOUT=*                                          
//SMF30IN DD DISP=SHR,DSN=USERID.WEEKLY.SMF30 
//OUT     DD DUMMY,                   
//        DISP=(NEW,CATLG,DELETE),                          
//        SPACE=(CYL,(50,50,)),
//        DCB=(LRECL=140,RECFM=FB)  
                               
The "EXCP Statistics" section of the JESYSMSG in the job log does not contain an entry for DDNAME OUT. This indicates that no I/O operations were performed against DDNAME OUT.

EXCP Statistics
===============
DDNAME   CC# Unit EXCP Count
STEPLIB      914F         15
SMF30IN   +2 9087      48001
SMF30IN      9186      48001
SMF30IN   +3 9284      23996
SMF30IN   +1 9080      26856
SMF30IN   +1 9187      48001

The above job with the output file defined without the DUMMY parameter
======================================================================
//STEP1 EXEC PGM=SMF30ASM                                
//STEPLIB  DD DSN=USERID.LOAD,DISP=SHR       
//SYSPRINT DD SYSOUT=*                                          
//SMF30IN DD DISP=SHR,DSN=USERID.WEEKLY.SMF30 
//OUT     DD DSN=USERID.SMF30.ASM.OUT,           
//        DISP=(NEW,CATLG,DELETE),                          
//        SPACE=(CYL,(50,50,))              
//        DCB=(LRECL=140,RECFM=FB)  
                           
The "EXCP Statistics" section of the JESYSMSG in the job log shows the number of I/O operations performed against DDNAME OUT.    

EXCP Statistics
===============
DDNAME   CC# Unit EXCP Count
STEPLIB      914F         15
SMF30IN   +1 9187      48001
SMF30IN   +2 9087      48001
SMF30IN      9186      48001
SMF30IN   +3 9284      23996
SMF30IN   +1 9080      26856
OUT          9187      12898
 

Friday, July 3, 2026

Mainframe Batch Window Optimization: Lessons from a 40,000-Jobs batch cycle

 Several years ago, I worked with a colleague on a Mainframe Batch Window Reduction initiative for a customer. The batch processing window ran from 7:00 PM to 10:30 PM, during which approximately 40,000 batch jobs were executed. The batch cycle contained around 10 critical processing paths.

Interestingly, nearly 99% of the 40,000 jobs completed within one minute, while only a handful of jobs had execution times ranging from 5 to 10 minutes. Most of the jobs were running every 1 minute.

We applied following techniques to reduce the batch window. As a resul we introduced about 45 minutes of slack time in 5 critical paths

->Removed unwanted job dependencies

->Worked with upstream applications to get input files early

->Moved non-critical jobs out of the critical paths where feasible.

->Preponed the time triggered jobs

->Converted sequential DB2 unload steps to parallel jobs

->Replaced IDCAMS steps with SORT where feasible.

->Replaced DSNTIAUL unload steps with BMC UNLOAD where feasible.

->Replaced BMC UNLOAD step by image copy where feasible.

Thursday, July 2, 2026

Understanding COND=(0,NE) in JCL

 The COND parameter is often one of the most confusing JCL concepts for beginners because its logic works in a somewhat counterintuitive way.

When you code COND=(0,NE) on a job step, the condition is evaluated against the return codes of all previous steps.

If any previous step returns a code other than 0, the condition (0,NE) evaluates to true, and the current step is bypassed (FLUSHED).

If all previous steps return 0, the condition evaluates to false, and the current step executes normally.

 In simple terms, COND=(0,NE) means "Execute this step only if all preceding steps completed successfully with RC=0."

Example

//STEP1    EXEC PGM=IDCAMS

//SYSPRINT DD SYSOUT=

//SYSIN    DD

  SET MAXCC=4

//*

//STEP2    EXEC PGM=IEFBR14

//*

//STEP3    EXEC PGM=IEFBR14,COND=(0,NE)

//*

Execution Results : 

 STEP3 is Flushed.

STEP1 ended with RC=04.

Since 04 is not equal to 0, the condition COND=(0,NE) becomes true.

As a result, STEP3 is skipped (FLUSHED).

 Even though STEP2 completed with RC=00, JCL evaluates the condition against all preceding steps, not just the immediately prior step. Because STEP1 returned a non-zero code, STEP3 does not execute.

Key takeaway: COND=(0,NE) is commonly used to ensure a step runs only when all earlier steps have completed successfully with a return code of zero.


 

Monday, June 29, 2026

How a Mainframe COBOL program misused MQ queue

I came across a Mainframe COBOL program reading messages one by one from MQ eueue, and does some processing and writes the output data to another MQ queue.  It is a very simple program.

The read was destructive MQGET and MQPUT was done using MQPMO_NO_SYNCPOINT so messages were written to the output MQ queue immediately. 

When I checked with application SMEs Why this program was designed this way instead of using plain sequential files.

The answer: restartability.  If this program abends, you simply restart the program, no clean up is needed.

This kind of approach can significantly increase Mainframe CPU utilization and place additional load on the MQ subsystem.

Sunday, June 28, 2026

Uncovering the Hidden Cause: A Db2 -911 Error That Silently Broke Data Consistency

Back in 2009, I was working as a Mainframe Consultant for a leading healthcare insurance company. The client’s system architecture was heavily based on CICS, with around 18 production regions handling healthcare claim adjudication. Each region was responsible for processing claims from specific parts of the United States and ran multiple background tasks simultaneously.

At that time, claim data was primarily stored in VSAM KSDS files, with each CICS region maintaining its own dataset. These records were quite large—around 18 KB each—and contained both the original claim data and the derived or calculated information generated during processing.

Around 2006–2007, the client has introduced Db2 into the claim processing flow. The monolithic VSAM structure was normalized into approximately 12 Db2 tables. These tables were split similarly into two logical groups—one for original claim details and another for derived data. Unlike VSAM, the Db2 tables were shared across all CICS regions, rather than being region-specific.

However, to minimize risk, the organization continued to treat VSAM as the primary data source, with Db2 acting as a shadow repository.

The claim processing flow remained largely unchanged: the system would read the VSAM record at the start, cache it in memory, process it across multiple programs, and finally write the updated data back to VSAM. With Db2 integration, additional SELECT, INSERT, and UPDATE statements were introduced throughout the processing steps to keep Db2 in sync with the cached data.

After deploying these changes to production, application SMEs observed inconsistencies—certain claims had missing or incomplete data in Db2 tables. Interestingly, the issue was not consistent; different tables were affected for different claims. Instead of investigating the root cause, a reconciliation batch job was implemented. This job compared VSAM records against Db2 and corrected discrepancies by inserting or updating missing data.

In 2009, I was assigned to debug a production issue related to claim processing. I set up an Xpediter session and carefully traced the execution of a claim through the system.

During debugging, I encountered a recurring Db2 SQL error: SQLCODE -911, which indicates a rollback due to a deadlock or timeout. I continued the session and noticed that, after processing, certain Db2 tables were missing data for the same claim.

This was a critical observation. The -911 error triggers a rollback, which means any previous INSERT or UPDATE operations in that logical unit of work are undone.

Further analysis revealed that the application had retry logic for handling SQLCODE -911. Whenever this error occurred, the program would retry the same SQL statement up to five times. If any retry succeeded, processing would continue as if nothing had happened.

The problem?

This retry logic was implemented for SELECT statements.

As a result:

A -911 error would roll back prior updates toDb2 tables.

A subsequent successful retry (on a SELECT) allowed processing to continue.

This led to partial or missing data in Db2 tables.

This flaw had gone unnoticed for nearly two years after Db2 integration.

When I presented my findings to the application SMEs, there was initial skepticism, partly because I was relatively new to the team. However, I substantiated the findings using official Db2 documentation explaining SQLCODE -911.

Eventually, the team acknowledged the root cause. The resolution involved significant changes to remove inappropriate retry logic and ensure proper handling of transactional failures.

This experience reinforced an important lesson:

Retry mechanisms must be carefully designed—especially in transactional systems—otherwise, they can silently introduce data inconsistencies.

Tuesday, June 2, 2026

Db2 for zOS 101 Db2 and WLM


The below notes are extracted from https://www.youtube.com/watch?v=Z8WdZ52NCbA

Terminology

Service definition

Consists of one or more service policies

Service policy

Contains several workloads
One service policy is active at a time in an LPAR or Parallel Sysplex

Workload
Arbitrary collection Consists of one or more service classes

Service class

Each service class has at least one period
- Each period has one goal
- If > 1 period, all but last have a duration

Goal (5 types)
System
Average response time
% response time
Execution velocity
Discretionary

Classification rules

WLM assigns address spaces and transactions to service classes by classification rules 

 Concepts: service class and classification


Classification

  • Assignment of incoming work to a service class, and optional report class
  • Based on a wide variety of filters, or qualifiers

 

Service class

Set or group of related work

  • Production CICS, IMS, and Db2 address spaced might be in the same service class: STCHI or PRODHI
  • Separate report classes can report on CICS, IMS, Db2 separately

Service class can combine goals of different types in multiple periods

  • Period is a combination of importance (IMP), goal, and duration
  • A service class period is the target of WLM measurement and management actions

Subsystem types and classification

 

Transaction Type

Allowable Goal Types

Allowable # of Periods

Address space oriented

Response Time

Execution Velocity
Discretionary

Multiple

Enclave

Response Time

Execution Velocity
Discretionary

Multiple

CICS/IMS

Response Time

One

Goal Types

System goals
SYSTEM, SYSSTC
- These have fixed dispatch priorities above IMP 1

Response time goals
Average response time
- Includes queue time and execution time
- Better for homogeneous type transactions

Percentile response time
- Reduce effect of outliers 

- Better for heterogeneous type transactions

 

Execution velocity goals (velocity goals)
Intended for work where response time goals are not appropriate
- Address spaces, long running jobs

Discretionary

For low priority, long-running work
- Probably not appropriate for Db2 work

 

Goal types: more detail


Execution velocity goals (velocity goals)

  • How fast work should run relative to other work requests when ready, without being delayed for CPU, storage or I/O
  • Expressed as a number, e.g. 60 or 40
    Value of 60 means 'ready' work runs 60% of the time
  • Differentiate velocity goals within an importance level by 10
  • Appropriate velocity goals depend on the number of engines (CPs)

 Response time goals

  • Average response time goal of 1 second - if one in 10 transactions takes 10 seconds, you will miss this goal
  • Percentile response time goal of 90% complete in 1 second, one transaction in 10 taking 10 seconds will not miss this goal

Importance

For most work, importance 1 (IMP 1) is highest and importance 5 (IMP 5) is lowest

WLM applies resources to IMP 1 first

If IMP 1 work meets its goals, WLM will apply resources to IMP 2, then IMP 3, then IMP 4, etc.

Some service trickles down to DISCRETIONARY

SYSTEM and SYSSTC are internal service classes for system tasks and have the highest dispatching priorities

SYSOTHER is the default service class for unclassified work and runs at a DISCRETIONARY goal

Note: not all work is "most important"

Importance levels and Db2, example

SYSTEM

z/OS

SYSSTC

IRLMs

IMP 1

Highest

DB2PMSTR, DB2PDBM1, DB2PDIST, DB2PWLMx

IMP 2

High

Production DDF txns

IMP 3

Medium

 

IMP 4

Low

Low priority work

IMP 5

Lowest

Lowest priority work

DISCRETIONARY

 

SYSOTHER

Default service class

Importance 1 is highest priority after SYSSTC

Db2 address spaces should have velocity goals and a single period defined

Non-production Db2s could be IMP 2 or IMP 3 or IMP 4 if in same LPAR (or Parallel Sysplex) with production Db2

Discretionary work gets service after all other importance levels

  • Not appropriate for Db2 address spaces
  • Not recommended for Db2 work
  • Very little service if CPU 100% busy

 WLM concepts and Db2 (notes)


Importance

  • Production Db2 address spaces (MSTR, DBM1, DIST, WLM) should be defined with Importance 1 (IMP 1)
  • Non-production Db2 address spaces in a production LPAR should be defined with lower importance: IMP > 1
    • Consider relative to other production work
  • Production DDF transactions should generally be defined with IMP below that of production Db2 address spaces
  • IRLMs should be defined in SYSSTC

 

Goals for Db2 work

  • System - IRLM in SYSSTC
  • Velocity goals are appropriate for started tasks or long-running work
    • Db2 address spaces should have velocity goals and only a single period in the service class (MSTR, DBM1, DIST, WLMx)
  • Response time goals are appropriate for transactions, including most DDF work
    • Percentile response time-e.g. 90% complete in 0.5 seconds
    • Average response time - e.g. average response time is 0.5 seconds
  • Discretionary: below IMP 5. Not appropriate for Db2 work, in general

 

Service class: assigning goal types (example only)


CICS, IMS or TSO transactions
E.g. average response time goal Transactions complete <0.7 seconds

Db2 Address Spaces
Velocity goal; IMP 1
Exec Vel = 70 , Single period

Production DDF Transactions
Percentile response time goal, single period
IMP 2; 90% complete < 0.5 seconds

Non-production DDF: response time goals in first period, response time or velocity in second period

Period 1: IMP 3, 90% complete <0.5 seconds
Period 2: IMP 4, 90% complete < 4 seconds
Period 3: MP 5, Vel = 40


Service class: period switch - example


PERIOD 1
DUR = 300
IMP = 3
R/T = 90% in 0.5 sec

PERIOD 2
DUR = 600
IMP = 4
R/T = 90% in 4 sec

PERIOD 3
IMP = 5
VEL = 40

All transactions assigned to this service class start in Period 1

  • WLM manages the transactions in period 1 to the percentile response time goal of 90% completing in half a second, with importance of 3

 Transactions that accumulate 300 service units (DUR = 300) before completing migrate to Period 2

  • New service class period; WLM manages the transactions in period 2 to the goal of 90% completing in 4 seconds, with importance of 4
  • This means, 90% of those that did not complete in period 1

Transactions that accumulate 900 service units (DUR 300 + DUR 600) before completing migrate to Period 3

  • A new service class period; WLM manages the transactions in period 3 to a velocity goal of 40, with importance of 5

"Service units" is a hardware independent measure of CPU consumption. If your transaction consumes 1000 service units on a z13, it should consume 1000 service units on a z14, z15 or z16 (and so on)

WLM managed delays

Processor
Dispatching priority

Non-paging DASD I/O
IOSQ, subchannel pending, control unit queue

Storage
Paging, swapping

Tasks
Multi-programming level, server address space creation
- e.g. WLM address spaces for external stored procedures

WLM cannot manage
-User delay
-Network delay
Any resources or delays not listed at right are shown as UNK (unknown)

Service class periods


WLM heuristic behavior is applied to service class periods

WLM can effectively manage 25-30 active service class periods

  • If you have more than 30 active service class periods, WLM may not be able to adjust resources for all of them when the system is busy
  • It is precisely when the system is busy that you want WLM to adjust resources to meet your business goals

"Loose" goals are performance goals that are too easily achieved

  • Service class periods with loose goals are likely to have a PI < 1, so WLM will always perceive they are meeting their goals.
  • Service class periods with loose goals may have a PI < 0.7, in which case they may become a donor
    • WLM may reduce resources for donor service class to apply to recipient service class (one with PI > 1)

 Defining Db2 address spaces to WLM


Db2 address spaces are started tasks

  • To WLM, the Db2 address spaces have a subsystem type of "STC"

 IRLMs should be defined in service class SYSSTC

Remaining Db2 address spaces should be assigned to a service class with a single period, a velocity goal and appropriate importance. For example,

  • Production: IMP 1
  • QA, Development and/or Test in same LPAR/Sysplex:
    • IMP > 1 (i.e. lower importance)
    • Adjust based on other production work, such as production batch
  • Db2 address spaces include ssnmMSTR, ssnmDBM1, ssnmDIST and ssnmWLMx for stored procedures

Three types of Db2 work

Local subsystem

Allied threads
-CICS
-IMS
-TSO
-WebSphere on zSystems
-MQ on zSystems

DDF threads

DDF requests use enclave SRBS
- zIIP-eligible 

Stored procedures

External stored procedures run in WLM application environments
Native SQL procedures run in DBM1 address space

First type: local attach

Db2 SQL activity runs under dispatchable unit of invoker

  • IMS, CICS, TSO, Batch, etc.
  • Inherited classification class of invoker
  • Priority and management of home unit
  • Service attributed back to invoker

 Second type: DDF and enclave SRBs

Why do we need enclaves?

-Manage DDF work separately from ssnmDIST

-Differentiate between high priority and low priority DDF work

Classifying DDF work


Define service classes and appropriate goals for DDF work

DDF Classification Defaults

  • Defaults apply if you do not provide any classification rules for DDF work
  • Enclaves default to the SYSOTHER service class (i.e. discretionary goal) unless they can be assigned to a service class

Managing DDF Work (Enclaves)

  • All goal types are permitted
  • Transactions may be subject to period switch
  • WLM manages an enclave with its own dispatching priority, etc.

High performance DBAT DDF transactions, we must use velocity goal