The Dynamically-Updated Request Online Coallocator (DUROC) v0.8: Function Reference

This is a document to specify the existing DUROC v0.8 implementation and interfaces, as they are provided in the Globus v1.0 release. This document serves as a reference, and more introductory text and examples can be found in the companion DUROC tutorial.

This document is maintained and best-viewed in HTML. The ASCII version suppresses typefaces and the hyperlinks which ease navigation.

Coallocator requirements and motivation  

 The Globus environment includes resource managers to provide access to a range of system-dependent schedulers. Each resource manager (RM) provides an interface to submit jobs on a particular set of physical resources.

In order to execute jobs which need to be distributed over resources accessed through independent RMs, a coallocator is used to coordinate transactions with each of the RMs and bring up the distributed pieces of the job. The coallocator must provide a convenient interface to obtain resources and execute jobs across multiple management pools.

Reflective management architecture

 The task an intelligent coallocation agent performs has two abstractly distinct parts. First, the agent must process resource specifications to determine how a job might be distributed across the resources of which it is aware--the agent lowers an abstract specification such that portions of the specification are allocated to the individual RMs that control access to those required resources. Second, the agent must process the lowered resource specification as part of a job request to actually attempt resource allocation--the agent issues job requests to each of the pertinent RMs to schedule the job. The process of lowering a resource specification in a job request in essence refines the request based on information available to the lowering agent. By separating the tasks of refinement and allocation in the architecture, we can allow user intervention to adjust the refinement based on information or constraints beyond the heuristics used internally by a particular automated agent. A GUI specification-editor has been suggested as a meaningful mode of user (job requester) intervention.

spec1 : resource specification
spec2 : resource specification
lower (spec1)   -->   spec2 spec : resource specification
job : job contact information (or error status)
request (spec)   -->   job

lowering example:
lower ( (count=5) )   -->  
 (+ (& (count=3) (resourceManagerContact=RM1 ))
    (& (count=2) (resourceManagerContact=RM2 )))

DUROC implements the allocation operation across multiple RMs in the Globus test-bed and leaves lowering decisions to higher-level tools.

Atomic requests

 Once a resource specification has been refined the agent must attempt to allocate resources. In general the resources might managed by different RMs, and the coallocator must atomically schedule the user's single abstract job or fail to schedule the job. Because the GRAM interface does not provide support for inter-manager atomicity, the user code must be augmented to implement a job-start barrier; as distributed components of the job become active, they must rendezvous with the allocating agent to be sure all components were successfully started prior to performing any non-restartable user operations.

main :
   job_start_barrier ( )
   . . .
   user_operations ( )

there are three important points regarding the job-start barrier in the user's code. First, atomicity of job creation can only guaranteed after the barrier, so the user should not perform operations which cannot be reversed, e.g. certain persistent effects or input/output operations, until after the barrier. Second, the barrier call is used to implement guaranteed job cancelation within each RM; if the agent's job scheduling fails but some of the components have been scheduled through a manager that cannot cancel jobs it schedules, the agent will have to rendezvous with those components when they become active and signal them to abort. Third, the barrier call initializes the job-aggregation communication functions needed to make use of the coallocated resources.

Coallocated resource specification language

 DUROC shares its Resource Specification Language (RSL) with GRAM. DUROC can perform allocations described by a 'lowered' resource specification. The task of the lowering agent is to take a resource request of some form, be it a generalized GRAM request or user inputs to a GUI interface, and produce a lowered request so that DUROC can directly acquire the resources for the user. The allocation semantics for DUROC requests are that each component of the top-level multi-request represents one GRAM request that DUROC should make as part of the distributed job DUROC is allocating. In order to make the request, DUROC must be able to determine what RM to contact. Typically there will be additional terms in the conjunctions of the lowered request, and those terms will be passed on verbatim in each GRAM request. DUROC will extract each component of the lowered multi-request, remove the DUROC-specific components of the subrequest, and then forward that subrequest to the specified GRAM. Therefore any other attributes supported by GRAM are implicitly supported by DUROC. For example:

+(&(resourceManagerContact=RM1)(count=3)(executable=myprog.sparc))
(&(resourceManagerContact=RM2)(count=2)(executable=myprog.rs6000))

in this request the executables and node counts are specified for each resource pool. While GRAM may in fact require fields such as these, DUROC treats them as it would any other fields not needed to do its job--it forwards them in the subrequests and it is up the the RMs to either successfully handle the request or return a failure-code back to DUROC (which will then return an appropriate code to the user).

DUROC request processing (coallocation)

 Requests submitted to the DUROC API are decomposed into the individual GRAM requests and each request is submitted through the GRAM API. A DUROC request proceeds with each GRAM request in the job that succeeds. Runtime features available to the job processes include a start barrier and inter-process communications to help coordinate the job processes. The start barrier allows the processes to synchronize before performing any non-restartable operations. In the absence of a start barrier, there is no way to guarantee that all job components are successfully created prior to executing user code. The communications library provides two simple mechanisms to send start-up and bootstrapping information between processes: an inter-subjob mechanism to communicate between ``node 0'' of each subjob, and an intra-subjob mechanism to communicate between all the nodes of a single subjob. A library of common bootstrapping operations is provided, using the public inter-subjob and intra-subjob communication interfaces. For each GRAM subjob in the DUROC job, there are two optional RSL fields which affect the subjob behavior. The `subjobStartType' field allows the user to configure each subjob to either participate in the start barrier with strict subjob-state monitoring (value `strict-barrier'), participate in the start barrier without strict subjob-state monitoring (value `loose-barrier), or not participate in the barrier at all (value `no-barrier'). Subjobs that don't perform the barrier run forward independently of the other subjobs. Strict state monitoring means that the job will be automatically killed if the subjob terminates prior to completing the barrier.

The `subjobCommsType' field allows the user to configure each subjob to either join the inter-subjob communications group as a blocking operation (value `blocking-join') or not join the inter-subjob communications group at all (value `independent'). When joining the group as a blocking operation, all participating subjobs will join together, i.e. the communications startup function will function as a group barrier.

Generic resource coallocation API

 The resource coallocation API provides functions for submitting a job request to a broker, for editing a submitted request, for cancelling a request, and for requesting job state information. The Dynamically-Updated Request Online Coallocator API (DUROC) is similar to that of the Resource Management API (GRAM), with the addition of the subjob-add, subjob-delete, and barrier-release operations for managing resources, the runtime-barrier operation which must be performed during the startup of each node, and the job-structure and inter-subjob communication interface operations which at runtime provide a mechanism for job self-organization.

The following API documents the DUROC v0.8 API, including runtime operations necessary to use DUROC v0.8.

Duroc control-library API

int
globus_module_activate
(
          GLOBUS_DUROC_CONTROL_MODULE)Activate the DUROC control-library API implementation prior to using any of the API functions.

  • Returns GLOBUS_SUCCESS if successful, otherwise one of: [no errors currently defined]

int
globus_module_deactivate
(
          GLOBUS_DUROC_CONTROL_MODULE)Deactivate the DUROC control-library API implementation when finished using any of the API functions.

  • Returns GLOBUS_SUCCESS if successful, otherwise one of: [no errors currently defined]

int
globus_duroc_control_init
(
          globus_duroc_control_t      *     controlp)Initialize a globus_duroc_control_t object for subsequent coallocated-job submission and control.

  • controlp is the globus_duroc_control_t object to initialize.
  • Returns GLOBUS_DUROC_SUCCESS if successful, otherwise one of: [no errors currently defined]

A single globus_duroc_control_t object can be used to concurrently submit and control multiple DUROC jobs.


int
globus_duroc_control_job_request
(
          globus_duroc_control_t      *     controlp,
          const char                       *     description,
          int                                         job_state_mask,
          const char                      *      callback_contact,
          char                             **      job_contactp,
          int                                 *      subreq_countp,
          int                               **      subreq_resultsp)Request coallocation of interactive resources at the current time.

  • controlp points to a globus_duroc_control_t object previously initialized with globus_duroc_control_init(). description is a description of the requested job. job_state_mask is 0 or a bitwise OR of the GLOBUS_DUROC_JOB_STATE_* states listed above or GLOBUS_DUROC_JOB_STATE_ALL [currently ignored]. callback_contact is the URL to which events about the job should be reported [currently ignored]. job_contactp is a pointer to a character string storage pointer. subreq_countp is a pointer to integer storage. subreq_resultsp is a pointer to an integer storage array pointer. If successful, *job_contactp is set to a unique identifier for the job and can be used as a handle in other DUROC API functions on this globus_duroc_control_t object, *subreq_count is set to the number of subrequests found in description, and *subreq_resultsp is set to point at a freshly-allocated array of integers holding the gram_job_request() result codes for each subrequest. Return GLOBUS_DUROC_SUCCESS on success, or one of the following error codes: [no errors currently defined] The array returned in *subreq_resultsp and the string returned in *job_contactp should be freed with globus_free() when the values are no longer needed.
  • A job submitted through this interface can subsequently be controlled with the other DUROC API functions by providing the submitted job's contact string to those calls.

int
globus_duroc_control_subjob_add
(
          globus_duroc_control_t      *     controlp,
          const char                       *     job_contact,
          const char                       *     subjob_description)Augment a coallocation with an additional interactive resource at the current time.

  • controlp points to a globus_duroc_control_t object previously initialized with globus_duroc_control_init(). job_contact is as returned by globus_duroc_job_request. subjob_description is a description of the subjob to be added.
  • Return GLOBUS_DUROC_SUCCESS on success, or one of the following error codes: [no errors currently defined]

A job modified through this interface can subsequently be controlled with the other DUROC API functions by providing the job's contact string to those calls.


int
globus_duroc_control_subjob_delete
(
           globus_duroc_control_t      *     controlp,
           const char                       *     job_contact,
           const char                       *     subjob_label)Modify a coallocation by removing an interactive resource at the current time.

  • controlp points to a globus_duroc_control_t object previously initialized with globus_duroc_control_init(). job_contact is as returned by globus_duroc_job_request. subjob_label is the label of a subjob previously created via globus_duroc_control_job_submit or globus_duroc_control_subjob_add.
  • Return GLOBUS_DUROC_SUCCESS on success, or one of the following error codes: [no errors currently defined]

A job modified through this interface can subsequently be controlled with the other DUROC API functions by providing the job's contact string to those calls.


int
globus_duroc_control_barrier_release
(
          globus_duroc_control_t      *     controlp,
          const char                       *     job_contact,
          globus_bool_t                          wait_for_subjobs)Remove a Pending job request or kill processes associated with an Active request, releasing any associated resources, if such action is supported by the associated resource managers.

  • controlp is the same globus_duroc_control_t object to which the job was submitted. job_contact is as returned by globus_duroc_job_request wait_for_subjobs is one of the values:
    • GLOBUS_TRUE: release subjobs once they all enter barrier
    • GLOBUS_FALSE: release subjobs if they have already entered the barrier, cancel the job if any subjob has not yet entered the barrier
  • Returns GLOBUS_DUROC_SUCCESS if successful, otherwise one of: [no errors currently defined]

This routine allows subjobs to run forward past the runtime barrier, and currently delimits a point after which subjobs cannot be added or deleted.


int
globus_duroc_control_job_cancel
(
          globus_duroc_control_t       *     controlp,
          const char                        *     job_contact)Remove a Pending job request or kill processes associated with an Active request, releasing any associated resources, if such action is supported by the associated resource managers.

  • controlp is the same globus_duroc_control_t object to which the job was submitted. job_contact is as returned by grub_job_request
  • Returns GLOBUS_DUROC_SUCCESS if successful, otherwise one of: [no errors currently defined]

This routine ``succeeds'' if the job is known. A successful return code does not guarantee that all job resources were successfully released.


int
globus_duroc_control_subjob_states
(
          globus_duroc_control_t       *     controlp,
          const char                        *     job_contact,
          int                                   *    subjob_countp,
          int                                 **    subjob_statesp,
          char                             ***    subjob_labelsp)Obtain a snapshot of the status of each subjob in a submitted DUROC job.

  • controlp is the same globus_duroc_control_t object to which the job was submitted. job_contact is as returned by globus_duroc_job_request subjob_countp is a pointer to integer storage. subjob_statesp is a pointer to an integer storage array pointer. subjob_labelsp is a pointer to a string storage pointer. If successful, *subjob_count is set to the number of subjobs for which state information is known, *subjob_statesp is set to point to a freshly-allocated array of integer subjob states, and *subjob_labelsp is set to point to a freshly-allocated array of freshly-allocated string labels (or NULL values for subjobs which weren't given labels by the user). The individual subjob states are defined as:
    • GLOBUS_DUROC_SUBJOB_STATE_PENDING: the subjob's GRAM request succeededGLOBUS_DUROC_SUBJOB_STATE_ACTIVE: the subjob's GRAM job is active (but not checked in)GLOBUS_DUROC_SUBJOB_STATE_CHECKED_IN: the subjob runtime system has checked inGLOBUS_DUROC_SUBJOB_STATE_RELEASED: the subjob runtime system has checked in and been releasedGLOBUS_DUROC_SUBJOB_STATE_DONE: the subjob's GRAM job is done and the subjob runtime system has been released
    • GLOBUS_DUROC_SUBJOB_STATE_FAILED: the subjob's GRAM job has terminated and the subjob runtime system was not released
    Returns GLOBUS_DUROC_SUCCESS if successful, otherwise one of: [no errors currently defined]
  • The arrays and strings returned in *subjob_statesp and subjob_labelsp should be freed with globus_free() when the values are no longer needed.

This routine can effectively be used in a polling loop to monitor the status of a job, for example in the display loop of a GUI agent.

DUROC runtime-library API

int
globus_module_activate
(
          GLOBUS_DUROC_RUNTIME_MODULE)Activate the DUROC runtime-library API implementation prior to using any of the API functions.

  • Returns GLOBUS_SUCCESS if successful, otherwise one of: [no errors currently defined]

int
globus_module_deactivate
(
          GLOBUS_DUROC_RUNTIME_MODULE)Deactivate the DUROC runtime-library API implementation when finished using any of the API functions.

  • Returns GLOBUS_SUCCESS if successful, otherwise one of: [no errors currently defined]

void
globus_duroc_runtime_barrier
()Rendezvous with the coallocator to implement job-start atomicity and coordinate the distributed processes.

  • Returns only when all job processes have successfully started.

This routine is called by the job processes at startup to implement job-start atomicity. It is not really part of the coallocation API in that it is called by the job, rather than by the process requesting a job.


int
globus_duroc_runtime_inter_subjob_structure
(
          int        *     local_addressp,
          int        *     remote_countp,
          int       **    remote_addressesp)Get the layout of the DUROC job. The DUROC inter-subjob communication routines can only be called on the subjob node where globus_duroc_runtime_intra_subjob_rank() reports the rank as zero (0)!

  • local_addressp is a pointer to integer storage. remote_countp is a pointer to integer storage. remote_addressesp is a pointer to an integer storage array pointer. Return GLOBUS_DUROC_SUCCESS and initialize *local_addressp with the local subjob's communication address, *remote_countp with the number of remote subjobs, and *remote_addressesp with a freshly-allocated array containing the remote subjobs' communication addresses, or return one of the following error codes: [no errors currently defined]
  • The array returned in *remote_addressesp should be freed with globus_free() when the values are no longer needed.

This routine is called by the job processes after the inter-subjob initialization operation to find the layout of the job. It is not really part of the coallocation API in that it is called by the job, rather than by the process requesting a job.


int
globus_duroc_runtime_inter_subjob_send
(
          int                            dst_addr,
          const char           *    tag,
          int                            msg_size,
          globus_byte_t      *     msg)Send a byte-vector to another subjob in the DUROC job. The DUROC inter-subjob communication routines can only be called on the subjob node where globus_duroc_runtime_intra_subjob_rank() reports the rank as zero (0)!

  • dst_rank is the address of the destination subjob. tag is a nul-terminated string which must match that provided to the receive call on the destination subjob. msg_len is the number of bytes of payload to send. msg is a pointer to the payload of msg_len values of type globus_byte_t.
  • Return GLOBUS_DUROC_SUCCESS or one of the following error codes: [no errors currently defined]

This routine is called by the job processes after the inter-subjob initialization operation to transmit messages between subjobs. The data is received by a corresponding call to globus_duroc_runtime_inter_subjob_receive at the destination subjob. It is not really part of the coallocation API in that it is called by the job, rather than by the process requesting a job.


int
globus_duroc_runtime_inter_subjob_receive
(
          const char             *      tag,
          int                        *     msg_sizep,
          globus_byte_t       **     msgp)Receive a byte-vector sent by another subjob in the DUROC job. The DUROC inter-subjob communication routines can only be called on the subjob node where globus_duroc_runtime_intra_subjob_rank() reports the rank as zero (0)!

  • tag is a nul-terminated string which must match that provided to the send call on the originating subjob. msg_sizep is a pointer to integer storage. msgp is a pointer to a globus_byte_t storage array pointer. Return GLOBUS_DUROC_SUCCESS and initialize *mesg_sizep with the length of the incoming message payload and *msgp with a freshly allocated array of globus_byte_t values containing the message payload, or return one of the following error codes: [no errors currently defined]
  • The array returned in *msgp should be freed with globus_free() when the values are no longer needed.

This routine is called by the job processes after the inter-subjob initialization operation to receive messages from other subjobs. The data is transmitted by a corresponding call to globus_duroc_runtime_inter_subjob_send at the originating subjob with a matching message tag, and messages are queued and reordered if the subjob receives messages with a different tag than the one requested by the receiving subjob process. It is not really part of the coallocation API in that it is called by the job, rather than by the process requesting a job.


int
globus_duroc_runtime_intra_subjob_rank
(
           int       *     rankp)Obtain the rank of the local subjob process.

  • rankp is a pointer to integer storage.

This routine is called by the job processes after the intra-subjob initialization operation to obtain the rank of the local subjob process. It is not really part of the coallocation API in that it is called by the job, rather than by the process requesting a job.


int
globus_duroc_runtime_intra_subjob_size
(
          int     *      sizep)Obtain the size of the local subjob process.

  • sizep is a pointer to integer storage.

This routine is called by the job processes after the intra-subjob initialization operation to obtain the number of local subjob processes. It is not really part of the coallocation API in that it is called by the job, rather than by the process requesting a job.


void
globus_duroc_runtime_intra_subjob_send
(
          int                             dst_rank,
          const char           *     tag,
          int                             msg_size,
          globus_byte_t       *     msg)Send a byte-vector to another process in the DUROC subjob.

  • dst_rank is the rank of the destination process. tag is a nul-terminated string which must match that provided to the receive call on the destination process. msg_len is the number of bytes of payload to send.
  • msg is a pointer to the payload of msg_len values of type gram_byte_t.

This routine is called by the job processes after the intra-subjob initialization operation to transmit messages between subjob processes. The data is received by a corresponding call to globus_duroc_runtime_intra_subjob_receive at the destination subjob. It is not really part of the coallocation API in that it is called by the job, rather than by the process requesting a job.


void
globus_duroc_runtime_intra_subjob_receive
(
          const char          *      tag,
          int                     *      msg_sizep,
          globus_byte_t      *      msg)Receive a byte-vector sent by another process in the DUROC subjob.

  • tag is a nul-terminated string which must match that provided to the send call on the originating process. msg_sizep is a pointer to integer storage. msgp is a pointer to a gram_byte_t storage array of at least GRAM_MYJOB_MAX_BUFFER_LENGTH bytes.
  • Return GLOBUS_DUROC_SUCCESS and initialize *mesg_sizep with the length of the incoming message payload and msg[0] to msg[(*msg_sizep)-1] with the message payload.

This routine is called by the job processes after the intra-subjob initialization operation to receive messages from other subjob processes. The data is transmitted by a corresponding call to globus_duroc_runtime_intra_subjob_send at the originating process with a matching message tag, and messages are queued and reordered if the process receives messages with a different tag than the one requested by the receiving call. It is not really part of the coallocation API in that it is called by the job, rather than by the process requesting a job.

DUROC bootstrap-library

int
globus_module_activate
(
          GLOBUS_DUROC_BOOTSTRAP_MODULE)Activate the DUROC bootstrap-library implementation prior to using any of the API functions.

  • Returns GLOBUS_SUCCESS if successful, otherwise one of: [no errors currently defined]

int
globus_module_deactivate
(
          GLOBUS_DUROC_BOOTSTRAP_MODULE)Deactivate the DUROC bootstrap-library implementation when finished using any of the API functions.

  • Returns GLOBUS_SUCCESS if successful, otherwise one of: [no errors currently defined]

void
globus_duroc_bootstrap_subjob_exchange
(
          const char     *      local_info,
          int                *      subjob_countp,
          int                *      local_indexp,
          char          ***      subjob_info_arrayp)Perform an exchange of information between subjobs.

  • local_info is a nul-terminated string of information to be broadcast to other subjobs. subjob_countp and local_indexp are pointers to integer storage. *subjob_info_arrayp is a pointer to string pointers. Returns after the exchange is complete, initializing *subjob_countp with the total number of subjobs, *local_indexp with the local subjob's index (0 <= *local_indexp < *subjob_countp), and *subjob_info_arrayp with an array of subjob information strings. The string in (*subjob_info_arrayp)[i] is the local information string broadcast by the ith subjob.
  • The array and strings returned in *subjob_info_arrayp should be freed with globus_free() when the values are no longer needed by the caller.

This routine is called by the job processes after the bootstrap activation operation to exchange string information between subjobs. It is not really part of the coallocation API in that it is called by the job, rather than by the process requesting a job.


void
globus_duroc_bootstrap_master_sp_vector
(
          nexus_startpoint_t        *     local_sp,
          int                              *     job_sizep,
          nexus_startpoint_t      **     sp_vectorp)Construct a vector of Nexus startpoints on the master node.

  • *local_sp is the startpoint to send to the master. job_sizep is a pointer to integer storage. *sp_vectorp is a pointer to Nexus startpoints. Returns after the construction is complete, initializing *sp_vectorp on one master node with an array of Nexus startpoints, and *job_sizep with the total number of processes; on all other nodes *sp_vectorp is set to NULL and *job_sizep is undefined. The startpoint (*sp_vectorp)[i] is the local startpoint provided by the ith node. The master is always node 0 (zero).
  • The array returned in *sp_vectorp should be freed with globus_free() after the values are no longer needed by the caller and after nexus_startpoint_destroy() has been called on each startpoint.

This routine is called by the job processes after the bootstrap activation operation to construct a startpoint vector on the master node. It is not really part of the coallocation API in that it is called by the job, rather than by the process requesting a job.


void
globus_duroc_bootstrap_ordered_master_sp_vector
(
          nexus_startpoint_t       *     local_sp,
          int                                   subjob_index,
          int                             *     job_sizep,
          nexus_startpoint_t      **    sp_vectorp)

Construct a vector of Nexus startpoints on the master node.

  • *local_sp is the startpoint to send to the master. subjob_index is the user-assigned index to position this subjob with respect to other subjobs (indices must be unique and in the range [0,N-1] for N subjobs). job_sizep is a pointer to integer storage. *sp_vectorp is a pointer to Nexus startpoints. Returns after the construction is complete, initializing *sp_vectorp on one master node with an array of Nexus startpoints, and *job_sizep with the total number of processes; on all other nodes *sp_vectorp is set to NULL and *job_sizep is undefined. The startpoint (*sp_vectorp)[i] is the local startpoint provided by the ith node. The master is always node 0 (zero) and belongs to the subjob with subjob_index of 0 (zero).
  • The array returned in *sp_vectorp should be freed with globus_free() after the values are no longer needed by the caller and after nexus_startpoint_destroy() has been called on each startpoint.

This routine is called by the job processes after the bootstrap activation operation to construct a startpoint vector on the master node. It differs from the simpler globus_duroc_bootstrap_master_sp_vector() routine in that it allows some extra control over the selection of a master node for expert users with special considerations. It is not really part of the coallocation API in that it is called by the job, rather than by the process requesting a job.

DUROC source manifest

 

The ResourceManagement/duroc directory in your Globus source tree should contain the following directories and files:

README  this document in plain ASCII format
doc/  all DUROC documentation
doc/duroc.html  this document in HTML format
src/bootstrap/  bootstrapping (communication utility) library sources
src/control/  control (coallocation API) library sources
src/misc/  miscellaneous shared sources
src/runtime/  runtime (barrier) library sources
src/test/*  test-app sources
src/tools/*  command-line tool sources
Makefile.in  build script
configure.in  build script
config.status-r.in  build script
configure  build script
aclocal.m4  build script
makefile.vars.in  build script
bootstrap/  build directory for bootstrap library
control/  build directory for control library
misc/  build directory for miscellaneous shared code
runtime/  build directory for runtime library
test-app/  build directory for test application code
tools/  build directory for command-line tools

 

For each build-directory listed above, the following files exist:

 

./Makefile.in  build script
./configure.in  build script
./configure  build script
./aclocal.m4  build script

 

Building DUROC

 Globus uses the GNU autoconf system to configure and build on any supported platform. To build DUROC you can run the following commands in your Globus build directory (see the Globus docs for more general information):

% ./configure --enable-duroc   (plus any other desired options)
% make

The optional DUROC configuration flags are:

--enable-duroc-debug This build option enables copious debugging messages from DUROC, and is primarily of interest when debugging the DUROC implementation. Once enabled, these messages can be conditionally suppressed at runtime. --disable-duroc-warnings

This build option disables messages that report when various fault-handling mechanisms are triggered in DUROC. These messages can be left enabled in the build and conditionally suppresses at runtime.

Installing DUROC

 After building DUROC as described above, you can install it on your system by running the following command in the directory where Globus was built (see the Globus docs for more general information):

% make install

The following libraries and header files are installed (globus_duroc_common.h is referenced from the other header files):

globus_duroc_common.h
libglobus_duroc_bootstrap  globus_duroc_bootstrap.h
libglobus_duroc_control  globus_duroc_control.h
libglobus_duroc_runtime  globus_duroc_runtime.h

The following executables are installed:

globus-duroc-request
globus-duroc-test-app

The test application is used for regression tests. It is not a minimal demo nor does it restrict itself to using only public library interfaces.

Using the DUROC libraries

The DUROC control and runtime libraries can be linked and used individually or in any combination within the same program. The control library provides the DUROC request API and the runtime library is used by every process initiated via DUROC. The programs duroc/src/tools/duroc-request.c and duroc/src/tools/duroc-stub-app.c serve as examples of how to use the DUROC libraries.

  • duroc-request.c uses the control DUROC library to initiate distributed jobs. duroc-stub-app.c uses the runtime DUROC library and performs a DUROC check-in and some other simple runtime operations before exiting.
  • duroc-test-app.c uses the bootstrap and runtime DUROC libraries and performs several diagnostic bootstrapping operations to test the system before exiting.

The suggested method of compiling DUROC-aware applications is to use a Makefile and insert the file "makefile_header" (by hand or via autoconf if you want to avoid non-portable makefile features) which is installed as part of Globus. This file can be found in the ${sysconfdir} of your installation (which defaults to ${prefix}/etc). This file contains environment variable definitions with values discovered during the Globus configuration process. Assuming you have done that, and your application includes "globus_duroc_runtime.h", your makfile will have the following two flavors of rules (one for compiling and one for linking):

myapp.$(OFILE): myapp.c $(CC) $(CFLAGS) $(GLOBUS_DUROC_RUNTIME_CFLAGS) \ -I$(includedir) -c myapp.c myapp: myapp.$(OFILE) $(CC) $(CFLAGS) myapp.$(OFILE) -o myapp \ -L$(libdir) $(LDFLAGS) $(GLOBUS_DUROC_RUNTIME_LDFLAGS) \ $(GLOBUS_DUROC_RUNTIME_LIBS) $(LIBS)

If you are constructing a makefile to build your app as a globus component, simply replace "$(libdir)" and "$(includedir)" with "$(BUILD_DIR_LIB)" and "$(BUILD_DIR_INC)", respectively. In this case you should also use the standard Globus method of inserting makefile_header into your Makefile during the configuration process. The complete set of DUROC-related variables defined in the "makefile_header" are as follows:

  • For applications using the DUROC bootstrap library:
    GLOBUS_DUROC_BOOTSTRAP_CFLAGS, GLOBUS_DUROC_BOOTSTRAP_LDFLAGS, GLOBUS_DUROC_BOOTSTRAP_LIBS
    Applications using the bootstrap library must also link with the runtime library.For applications using the DUROC control library:
    GLOBUS_DUROC_CONTROL_CFLAGS, GLOBUS_DUROC_CONTROL_LDFLAGS, GLOBUS_DUROC_CONTROL_LIBS
  • For applications using the DUROC runtime library:
    GLOBUS_DUROC_RUNTIME_CFLAGS, GLOBUS_DUROC_RUNTIME_LDFLAGS, GLOBUS_DUROC_RUNTIME_LIBS

Using the DUROC tools

Below is a summary of the tools provided with DUROC. Each tool is a minimalist wrapper around DUROC library functions.

globus-duroc-request a command-line utility to initiate jobs.
synopsis:  globus-duroc-request  [ -i ] [ -q ] [ -s ]  spec
globus-duroc-request  [ -i ] [ -q ] [ -s ]  -f spec-file

 
spec is a DUROC resource specification string to be passed to the DUROC client API, or spec-file is the name of a file containing the specification string. The -i option enables interactive mode, otherwise automatic mode is used. The -q mode enables quiet mode to suppress advisory messages generated by the tool. The -s option enables synchronous mode (as for other schedulers) which means that stdout/stderr of the application will be directed to stdout/stderr of the request tool unless they were explicitly sent to file, and this mode implies quiet mode. Without -s the outputs of the application default in GRAM to the destination /dev/null.

In all modes, globus-duroc-request runs an instance of the control lib and makes the DUROC request. Unless the tool is in quiet mode, it then prints the result of the request operation, including the result codes for each subjob if the overall request is successful. In automatic mode, the tool continues after submission by releasing the barrier. The tool waits for job termination while the control library performs its processing and then exits. In interactive mode, the tool continues after submission by issuing a subjob-state summary and a prompt for user commands. The command language is very simple, consisting of the following operations:

Dlabel   Delete the subjob with the given label.
K   Kill the entire job.
C   Commit the job, releasing runtime barriers.
Q   Quit the tool immediately.

All commands begin with a single command character and the label parameter begins with the next character after this command character and continues to the first new-line. All unrecognized command characters or other extraneous characters are discarded but cause the prompt to be reissued. While waiting for user input, the tool continues job processing in the background. The user must explicitly terminate the tool. In interactive mode this can be accomplished via a signal or the `Q' quit command.

You must have the appropriate ~/.globus* GSSAPI configuration files installed in your home directory, and depending on the GSSAPI library used with the GRAM client, the tool may prompt for passwords. See the GSSAPI documentation for more information.

Known bugs and limitations

 The error codes documented in the API section of this file are a subset of the actual codes returned. The globus_duroc_runtime_inter_subjob_* and globus_duroc_runtime_intra_subjob_* interfaces are not yet reentrant. The user must refrain from calling any of the routines concurrently.

The globus_duroc_control_destroy operation does not properly flush all pending communication, so it is possible for a control agent to destroy the control object and exit such that critical messages are lost; in such cases the coallocated job may hang indefinitely. In practice, this does not occur with existing agent codes that wait for job termination before exiting.