GT 5.0.0 Release Notes: GRAM5


1. Component Overview

The Grid Resource Allocation and Management (GRAM5) component is used to locate, submit, monitor, and cancel jobs on Grid computing resources. GRAM5 is not a job scheduler, but rather a set of services and clients for communicating with a range of different batch/cluster job schedulers using a common protocol. GRAM5 is meant to address a range of jobs where reliable operation, stateful monitoring, credential management, and file staging are important.

2. Feature summary

New Features new since 4.2.x

  • Server-side architectural changes to improve scalability, performance, and reliability
  • Improved error notification protocol compared to GRAM2
  • Teragrid Gateway Identity support for job auditing
  • Usage stats messages
  • Added support for Sun Grid Engine (SGE)

Other Standard Supported Features

  • Remote job execution and management
  • Uniform and flexible interface to local resource managers
  • File staging before and after job execution
  • File and directory clean up after job termination
  • Service auditing for each submitted

Removed Features

  • The GRAM5 client tools have dropped support for the Duroc API for task coallocation
  • The GRAM5 service no longer streams output and error during job execution; instead this data is send after the job terminates
  • The GRAM5 service no longer provides intra-job communication via the DUCT API
  • The GRAM5 does not rely on XML schemas and WSDL service definitions

3. Summary of Changes in GRAM5

GRAM5 represents a significant improvement from GRAM2 and GRAM4 service implementations. GRAM2's limitation is scalability. GRAM4's is reliability. GRAM5 is both reliable AND scalable. It is important to note that GRAM5 is GRAM2 compatible. There are other improvements as well, like completely rewritten service logging based on the CEDPS logging best practices, Teragrid Gateway Identity support for job auditing, support for job exit codes, and usage stat support.

We have been very encouraged by our performance results, which shows greater than 10x scalability than GRAM2 and roughly 10x reduction in resource consumption on the service host. We welcome your feedback as you integrate GRAM5 into your production grids.

4. Bug Fixes

  • All Fixes: This link lists all 67 of the improvements and bug fixes that were done for the GT 5.0.0 release

5. Known Problems

The following problems and limitations are known to exist for GRAM5 at the time of the 5.0.0 release:

5.1. Limitations

  • [list limitations]

5.2. Outstanding bugs

  • Bug 108: Fork perl zombies
  • Bug 106: Fix test failures with SGE LRM adapter
  • Bug 105: Held Condor jobs should be reported as SUSPENDED
  • Bug 104: globus-job-manager-event-generator loads all historical events the first time run
  • Bug 103: Ease two phase end commit timeout
  • Bug 102: Fix Two Phase Commit Semantics for Failed Jobs
  • Bug 100: one of the RSL parameters is not supported error doesn't indicate which it is
  • Bug 99: Add a high-level diagram for the approach doc
  • Bug 98: Add Condor-G doc for using GRAM 2 and 5
  • Bug 96: GRAM-106 SGE LRM mishandles invalid environment definition
  • Bug 95: GRAM-106 SGE LRM doesn't check for executable permissions
  • Bug 94: GRAM-106 SGE LRM doesn't check for executable existence
  • Bug 93: GRAM-106 SGE LRM script doesn't handle environment vars with whitespace
  • Bug 92: stdout to local file doesn't work if count >1
  • Bug 88: Missed two phase commit causes job to not be destroyed
  • Bug 86: Prioritize script invocations to improve throughput
  • Bug 80: GRAM 5 beta2 release
  • Bug 79: Add support for OSG's "NFS Lite" concept
  • Bug 77: GRAM zombie
  • Bug 71: GRAM protocol test package contains expired test certificate
  • Bug 70: globus-job-status acts strange for completed jobs in GRAM5
  • Bug 69: globus-job-get-output -f doesn't work in GRAM5
  • Bug 68: Bad error when proxy is too short-lived
  • Bug 54: make globus-job-manager-event-generator not require configuration by default
  • Bug 53: Generalize log path configuration
  • Bug 51: configurable control of number of perl scripts that can run simultaneously
  • Bug 47: simplify the throughput tester program and use improved version as doc
  • Bug 24: Debug/verbose flags for globusrun, globus-job-run
  • Bug 23: Improved error codes and error reporting for users
  • Bug 22: client connections can't be timed out
  • Bug 15: transition from httpg to https
  • Bug 14: increase availability of GRAM in linux distributions
  • Bug 12: Gatekeeper's syslog output cannot be controlled
  • Bug 5: Add gram-level prologue and epilogue script execution for mpi jobs
  • Bug 4: Add support for a "managed fork" service
  • Bug 2: Investigate how to setup GRAM5 services in a HA setup

6. Technology dependencies

GRAM depends on the following GT components:

  • Globus Common
  • GSI C
  • GridFTP server

7. Tested platforms

Tested platforms for GRAM5:

  • Linux

    • CentOS 5.3 x86_64
    • Debain 4.0 x86_64

  • Mac OS X

    • Mac OS X 10.5.8

8. Backward compatibility summary

Protocol changes in GRAM since GT4.2.x series:

  • The GRAM5 service uses a superset of the GRAM2 protocol for communciation between the client and service. The extensions supported in GRAM5 are implemented in such a way that they are ignored by GRAM2 services or clients. These extensions provide improved error messages and version detection.
  • GRAM5 does not support task coallocation using DUROC and its related protocols. Jobs submitted using DUROC directives will fail.
  • GRAM5 does not support file streaming. The standard output and standard error streams are sent after the job completes instead of during execution.

9. Associated Standards

None

10. For More Information

See GRAM5 for more information about this component.

Glossary

J

job scheduler

See the term scheduler.