GT 4.0 WS GRAM: User's Guide

1. Introduction

GRAM services provide secure job submission to many types of job schedulers for users who have the right to access a job hosting resource in a Grid environment. The existence of a valid proxy is actually required for job submission. All GRAM job submission options are supported transparently through the embedded request document input. In fact, the job startup is done by submitting a client-side provided job description to the GRAM services. This submission can be made by end-users with the GRAM command-line tools.

2. New Functionality in GT4

2.1. Submission ID

A submission ID may be used in the GRAM protocol for reliability in the face of message faults or other transient errors in order to ensure that at most one instance of a job is executed, i.e. to prevent accidental duplication of jobs under rare circumstances with client retry on failure. By default, the globusrun-ws program will generate a submission ID (uuid). One can override this behavior by supplying a submission ID as a command line argument.

If a user is unsure whether a job was submitted successfully, he should resubmit using the same ID as was used for the previous attempt.

2.2. Job hold and release

It is possible to specify in a job description that the job be put on hold when it reaches a chosen state (see GRAM Approach documentation for more information about the executable job state machine, and see the job description XML schema documentation for information about how to specify a held state). This is useful, for example, when a GRAM client wishes to directly access output files written by the job (as opposed to waiting for the stage-out step to transfer files from the job host). The client would request that the file cleanup process be held until released, giving the client an opportunity to fetch all remaining/buffered data after the job completes but before the output files are deleted.

globusrun-ws uses job hold and release to ensure client-side streaming of remote files in batch mode.

2.3. MultiJobs

The new job description XML schema allows for specification of a multijob, i.e., a job that is itself composed of several executable jobs. This is useful in order to bundle a group of jobs together and submit them as a whole to a remote GRAM installation.

2.4. Job and process rendezvous

WS GRAM services implement a rendezvous mechanism to perform synchronization between job processes in a multiprocess job and between subjobs in a multijob. The job application can in fact register binary information, for instance process information or subjob information, and get notified when all the other processes or subjobs have registered their own information. This is useful for parallel jobs which need to rendezvous at a "barrier" before proceeding with computations, in the case when no native application API is available to help do the rendezvous.

3. Changed Functionality in GT4

3.1. Independent resource keys (4.0.5+ only)

[Important]Important

This change in functionality is only available starting with GT 4.0.5.

WS GRAM enables the client to add a self-generated resource key to the input type when submitting a new job request to the ManagedJobFactoryService (MJFS). This enables the client to keep in contact with the job in case the server fails after the job was created but before the EndpointReference (EPR) of the newly created job was sent to the client.

The client is then able to create an EPR itself with the self-generated job UUID and the address of the ManagedExecutableJobService (MEJS) and then query for the state of the job.

In former versions of WS GRAM, the job UUID that was generated on the client-side was used in WS GRAM as the resource key of the created job resource. This has changed: starting with GT 4.0.5, WS GRAM now creates its own job UUID, even if the client provides one in the input of its call to the MJFS, and returns this job key inside the EPR which is then returned to the client. With the self-generated job key, the client can still contact the MJFS and the MJFS will simply use that mapping. But the client cannot contact the MEJS with that self-generated job key as part of an EPR in order to query for job state.

4. Usage scenarios

The following scenarios walk you through tasks typically performed by WS GRAM users.

4.1. Generating a valid proxy

In order to generate a valid proxy file, use the grid-proxy-init tool available under $GLOBUS_LOCATION/bin:

% bin/grid-proxy-init
Your identity: /O=Grid/OU=GlobusTest/OU=simpleCA.mymachine/OU=mymachine/CN=John Doe
Enter GRID pass phrase for this identity:
Creating proxy ................................. Done
Your proxy is valid until: Tue Oct 26 01:33:42 2004

4.2. Submitting a simple job

Use the globusrun-ws command to submit a simple job without writing a job description document. With the -c option, a job description will be generated assuming the first arg is the executable and the remaining are arguments. For example:

   % globusrun-ws -submit -c /bin/touch touched_it
   Submitting job...Done.
   Job ID: uuid:4a92c06c-b371-11d9-9601-0002a5ad41e5
   Termination time: 04/23/2005 20:58 GMT
   Current job state: Active
   Current job state: CleanUp
   Current job state: Done
   Destroying job...Done.

Confirm that the job worked by verifying the file was touched:

   % ls -l ~/touched_it 
   -rw-r--r--  1 smartin globdev 0 Apr 22 15:59 /home/smartin/touched_it

   % date
   Fri Apr 22 15:59:20 CDT 2005
[Note]Note

You did not tell globusrun-ws where to run your job, so the default of localhost was used.

4.3. Submitting a job with the contact string

Use globusrun-ws to submit the same touch job, but this time specify the contact string.

   % globusrun-ws -submit -F https://lucky0.mcs.anl.gov:8443/wsrf/services/ManagedJobFactoryService -c /bin/touch touched_it
   Submitting job...Done.
   Job ID: uuid:3050ad64-b375-11d9-be11-0002a5ad41e5
   Termination time: 04/23/2005 21:26 GMT
   Current job state: Active
   Current job state: CleanUp
   Current job state: Done
   Destroying job...Done.

Try the same job to a remote host. Type

globusrun-ws -help

to learn the details about the contact string.

4.4. Submitting a job with the job description

The user writes the specifications of a job submission to a job description XML file.

Here is an example of a simple job description:

<job>
    <executable>/bin/echo</executable>
    <argument>this is an example_string </argument>
    <argument>Globus was here</argument>
    <stdout>${GLOBUS_USER_HOME}/stdout</stdout>
    <stderr>${GLOBUS_USER_HOME}/stderr</stderr>
</job>

Tell globusrun-ws to read the job description from a file, using the -f option:

% bin/globusrun-ws -submit -f test_super_simple.xml
Submitting job...Done.
Job ID: uuid:c51fe35a-4fa3-11d9-9cfc-000874404099
Termination time: 12/17/2004 20:47 GMT
Current job state: Active
Current job state: CleanUp
Current job state: Done
Destroying job...Done.

Note the usage of the substitution variable ${GLOBUS_USER_HOME} which resolves to the user home directory.

Here is an example with more job description parameters:

<?xml version="1.0" encoding="UTF-8"?>
<job>
    <executable>/bin/echo</executable>
    <directory>/tmp</directory>
    <argument>12</argument>
    <argument>abc</argument>
    <argument>34</argument>
    <argument>this is an example_string </argument>
    <argument>Globus was here</argument>
    <environment>
        <name>PI</name>
        <value>3.141</value>
    </environment>
    <stdin>/dev/null</stdin>
    <stdout>stdout</stdout>
    <stderr>stderr</stderr>
    <count>2</count>
</job>

Note that in this example,

  • A <directory> element specifies that the command will be executed in the /tmp directory on the execution machine.

  • An <stdout> element specifies the standard output as the relative path stdout.

The output is therefore written to /tmp/stdout:

% cat /tmp/stdout
12 abc 34 this is an example_string  Globus was here

4.5. Delegating credentials

There are three different uses of delegated credentials:

  1. for use by the MEJS to create a remote user proxy,

  2. for use by the MEJS to contact RFT, and

  3. for use by RFT to contact the GridFTP servers.

The EPRs to each of these are specified in three job description elements--they are:

  • jobCredentialEndpoint

  • stagingCredentialEndpoint

  • transferCredentialEndpoint

respectively. Please see the job description schema and RFT transfer request schema documentation for more details about these elements.

The globusrun-ws command can either delegate these credentials automatically for a particular job or reuse pre-delegated credentials (see next paragraph) through the use of command-line arguments for specifying the credentials' EPR files. Please see the globusrun-ws documentation for details on these command-line arguments.

It is possible to use delegation command-line clients to obtain and refresh delegated credentials in order to use them when submitting jobs to WS GRAM. This enables the submission of many jobs using a shared set of delegated credentials. This can significantly decrease the number of remote calls for a set of jobs, thus improving performance.

4.6. Finding which schedulers are interfaced by the WS GRAM installation

Unfortunately there is no option yet to print the list of local resource managers supported by a given WS-GRAM service installation. But there is a way to check whether or not WS-GRAM supports a certain local resource manager. The following command gives an example of how a client could find out if Condor is available at the remote site:

wsrf-query \
    -s https://<hostname>:<port>/wsrf/services/ManagedJobFactoryService \
    -key "{http://www.globus.org/namespaces/2004/10/gram/job}ResourceID" Condor \
    "//*[local-name()='version']"
   

Replace host and port settings with the values you need. If Condor is available on the server-side, the output should look something like the following:

  <ns1:version xmlns:ns1="http://mds.globus.org/metadata/2005/02">4.0.3</ns1:version>
   

In this example, the output indicates that a GT is listening on the server-side, that Condor is available and that the GT version is 4.0.3. If no GT is running at all on the specified host and/or port or if the specified local resource manager is not available on the server-side, the output will be an error message.

On the server-side, the GRAM name of local resource managers for which GRAM support has been installed can be obtained by looking at the GRAM configuration on the GRAM server-side machine, as explained here.

The GRAM name of the local resource manager can be used with the factory type option of the job submission command-line tool to specify which factory resource to use when submitting a job.

4.7. Specifying file staging in the job description

In order to do file staging, one must add specific elements to the job description and delegate credentials appropriately (see Delegating credentials). The file transfer directives follow the RFT syntax, which allows only for third-party transfers. Each file transfer must therefore specify a source URL and a destination URL. URLs are specified as GridFTP URLs (for remote files) or as file URLs (for files local to the service--these are converted internally to full GridFTP URLs by the service).

For instance, in the case of staging a file in, the source URL would be a GridFTP URL (for example, gsiftp://job.submitting.host:2811/tmp/mySourceFile) resolving to a source document accessible on the file system of the job submission machine (for instance /tmp/mySourceFile). At run-time, the Reliable File Transfer service used by the MEJS on the remote machine would reliably fetch the remote file using the GridFTP protocol and write it to the specified local file (for example, file:///${GLOBUS_USER_HOME}/my_transfered_file, which resolves to ~/my_transfered_file). Here is how the stage-in directive would look:

  <fileStageIn>
      <transfer>
          <sourceUrl>gsiftp://job.submitting.host:2811/tmp/mySourceFile</sourceUrl>
          <destinationUrl>file:///${GLOBUS_USER_HOME}/my_transfered_file</destinationUrl>
      </transfer>
  </fileStageIn>

[Note]Note

Additional RFT-defined quality of service requirements may be specified for each transfer. See the RFT documentation for more information.

Here is an example job description with file stage-in and stage-out:

<job>
    <executable>my_echo</executable>
    <directory>${GLOBUS_USER_HOME}</directory>
    <argument>Hello</argument>
    <argument>World!</argument>
    <stdout>${GLOBUS_USER_HOME}/stdout</stdout>
    <stderr>${GLOBUS_USER_HOME}/stderr</stderr>
    <fileStageIn>
        <transfer>
            <sourceUrl>gsiftp://job.submitting.host:2811/bin/echo</sourceUrl>
            <destinationUrl>file:///${GLOBUS_USER_HOME}/my_echo</destinationUrl>
        </transfer>
    </fileStageIn>
    <fileStageOut>
        <transfer>
            <sourceUrl>file:///${GLOBUS_USER_HOME}/stdout</sourceUrl>
            <destinationUrl>gsiftp://job.submitting.host:2811/tmp/stdout</destinationUrl>
        </transfer>
    </fileStageOut>
    <fileCleanUp>
        <deletion>
            <file>file:///${GLOBUS_USER_HOME}/my_echo</file>
        </deletion>
    </fileCleanUp>
</job>

Note that the job description XML does not need to include a reference to the schema that describes its syntax. As a matter of fact, it is also possible to omit the namespace in the GRAM job description XML elements. The submission of this job to the GRAM services causes the following sequence of actions:

  1. The /bin/echo executable is transferred from the submission machine to the GRAM host file system. The destination location is the HOME directory of the user on behalf of whom the GRAM services executed the job (see <fileStageIn>).
  2. The transferred executable is used to print a test string (see <executable>, <directory> and the <argument> elements) on the standard output, which is redirected to a local file (see <stdout>).
  3. The standard output file is transferred to the submission machine (see <fileStageOut>).
  4. The file that was initially transferred during the stage-in phase is removed from the file system of the GRAM installation (see <fileCleanup>).

4.8. Specifying and submitting a multijob

The job description XML schema allows for specification of a multijob i.e. a job that is itself composed of several executable jobs, which we will refer to as subjobs.

[Note]Note

Subjobs cannot be multijobs, so the structure is not recursive.

This is useful, for example, if you want to bundle a group of jobs together and submit them as a whole to a remote GRAM installation.

[Note]Note

No relationship can be specified between the subjobs of a multijob. The subjobs are submitted to job factory services in order of appearance in the multijob description.

Within a multijob description, each subjob description must include an endpoint to which the factory submits the subjob. This enables the at-once submission of several jobs to different hosts. The factory to which the multijob is submitted acts as an intermediary tier between the client and the eventual executable job factories.

Here is an example of a multijob description:

      <?xml version="1.0" encoding="UTF-8"?>
      <multiJob xmlns:gram="http://www.globus.org/namespaces/2004/10/gram/job" 
      xmlns:wsa="http://schemas.xmlsoap.org/ws/2004/03/addressing">
      <factoryEndpoint>
      <wsa:Address>
      https://localhost:8443/wsrf/services/ManagedJobFactoryService
      </wsa:Address>
      <wsa:ReferenceProperties>
      <gram:ResourceID>Multi</gram:ResourceID>
      </wsa:ReferenceProperties>
      </factoryEndpoint>
      <directory>${GLOBUS_LOCATION}</directory>
      <count>1</count>
      
      <job>
      <factoryEndpoint>
      <wsa:Address>https://localhost:8443/wsrf/services/ManagedJobFactoryService</wsa:Address>
      <wsa:ReferenceProperties>
      <gram:ResourceID>Fork</gram:ResourceID>
      </wsa:ReferenceProperties>
      </factoryEndpoint>
      <executable>/bin/date</executable>
      <stdout>${GLOBUS_USER_HOME}/stdout.p1</stdout>
      <stderr>${GLOBUS_USER_HOME}/stderr.p1</stderr>
      <count>2</count>
      </job>
      
      <job>
      <factoryEndpoint>
      <wsa:Address>https://localhost:8443/wsrf/services/ManagedJobFactoryService</wsa:Address>
      <wsa:ReferenceProperties>
      <gram:ResourceID>Fork</gram:ResourceID>
      </wsa:ReferenceProperties>
      </factoryEndpoint>
      <executable>/bin/echo</executable>
      <argument>Hello World!</argument>        
      <stdout>${GLOBUS_USER_HOME}/stdout.p2</stdout>
      <stderr>${GLOBUS_USER_HOME}/stderr.p2</stderr>
      <count>1</count>
      </job>
      
      </multiJob>
    

Notes:

  • The <ResourceID> element within the <factoryEndpoint> WS-Addressing endpoint structures must be qualified with the appropriate GRAM namespace.
  • Apart from the <factoryEndpoint> element, all elements at the enclosing multijob level act as defaults for the subjob parameters, in this example <directory> and <count>.
  • The default <count> value is overridden in the subjob descriptions.

In order to submit a multijob description, use a job submission command-line tool and specify the Managed Job Factory resource to be Multi. For example, submitting the multijob description above using globusrun-ws, we obtain:

      % bin/globusrun-ws -submit -f test_multi.xml
      Delegating user credentials...Done.
      Submitting job...Done.
      Job ID: uuid:bd9cd634-4fc0-11d9-9ee1-000874404099
      Termination time: 12/18/2004 00:15 GMT
      Current job state: Active
      Current job state: CleanUp
      Current job state: Done
      Destroying job...Done.
      Cleaning up any delegated credentials...Done.
    

A multijob resource is created by the factory and exposes a set of WSRF resource properties different than the resource properties of an executable job. The state machine of a multijob is also different since the multijob represents the overall execution of all the executable jobs of which it is composed.

4.9. Lifetime of jobs

Jobs submitted to WS-GRAM have a lifetime. If the lifetime of a ManagedJob resource expires the job will be destroyed after cleanup steps had been performed and the job's persistence data will removed.

For executable jobs the user-relevant steps in cleanup are:

  • Cancellation of the job at the local resource manager if it's still running.
  • Performing fileCleanUp if specified in the job description and the job did not already pass this step.

If a multi job expires all sub-jobs will be destroyed.

The default C-client globusrun-ws and the Java API GramJob (for developers) set the lifetime by default to 24 hours. If a user wants a job to have a longer lifetime he/she must explicitly specify it.

4.9.1. Specifying a lifetime in submission

Using the C-client globusrun-ws the lifetime of a job can be set in 2 ways in job submission. The first example shows how to set a relative lifetime, i.e. the job will expire in 48h from now:

globusrun-ws -submit -term "+48:00" -b -o myJob.epr -f myJob.xml

The second example shows how to set an absolute lifetime. The job will expire at the given date:

globusrun-ws -submit -term "10/23/2008 12:00" -b -o myJob.epr -f myJob.xml

In both example the job had been submitted in batch mode, which makes sense for longer running jobs.

[Note]Note

Specyfing the lifetime in the short way using +HH:MM should not be used with globusrun-ws before GT version 4.0.7. See here for more information.

4.9.2. Setting a lifetime on an existing job

The lifetime of a job can also be changed after the submission. The following example shows how to set a new termination time of a job resource, assuming that the Endpoint Reference (EPR) of the job is stored in the file myJob.epr. The new lifetime is provided in seconds (604800 in this example [one week]):

[martin@osg-test1 ~]$ wsrf-set-termination-time -e myJob.epr 604800

The output could be something like this:

requested: Tue May 13 09:27:15 CDT 2008
scheduled: Tue May 13 09:27:15 CDT 2008

4.10. Specifying substitution variables in a job description

Job description variables are special strings in a job description that are replaced by the GRAM service with values that the client-side does not a priori know. Job description variables can be used in any path-like string or URL specified in the job description.

An example of a variable is ${GLOBUS_USER_HOME}, which represents the path to the HOME directory on the file system where the job is executed. The set of variables is fixed in the GRAM service implementation. This is different from previous implementations of RSL substitutions in GT2 and GT3, where a user could define a new variable for use inside a job description document. This was done to preserve the simplicity of the job description XML schema (relative to the GT3.2 RSL schema), which does not require a specialized XML parser to serialize a job description document.

Details of the RSL variables are in the job description doc and the substitution variable section of the admin guide.

[Important]If you are using 4.0.5+:

Beginning with version 4.0.5, additional variables can be defined on the server side for use in the job description.

Currently, users cannot get information from WS GRAM about whether or not additional variables are defined on the server side and, if so, what their names and values are. For now, this information must be published by the provider.

4.11. Specifying a self-generated resource key during job submission

WS GRAM enables a client to add a self-generated resource key to the input type when submitting a new job request to the ManagedJobFactoryService (MJFS). The client should make sure to provide a universal unique identifier (UUID) as the job resource key. For information about UUID's please read here.

Providing its own UUID enables a client to resubmit a job in case the server did not respond to a prior job submission request (due to network failures, for example). If the client submits a job with an already existing resource key a second time, the job will not be started again because it is already running. This avoids unnecessary and undesired resource usage and enables a reliable job submission.

[Important]If you are using 4.0.5+:

Beginning with version 4.0.5, WS GRAM now creates its own job UUID, even if the client provides one in the input of its call to the MJFS, and returns this job UUID inside the endpoint reference (EPR) to the client. The client can still contact the ManagedJobFactoryService (MJFS) with the self-generated job resource key in order to resubmit a potentially not 'lost' and submitted job. But the client can no longer contact the ManagedExecutableJobService (MEJS) with that self-generated job key as part of an EPR in order to query for job state.

If it is unclear whether a job request has been started by the server, the client must submit the job with the same job UUID again in order to get an EPR from the MJFS. The client can then query for job state or destroy the job.

4.12. Specifying and handling custom job description extensions (4.0.5+, update pkg available)

[Important]Important

This feature has been added as of GT 4.0.5. For versions older than 4.0.5, an update package is available to upgrade your installation. See the GT Development Downloads page for the latest links.

Basic support is provided for specifying custom extensions to the job description. There are plans to improve the usability of this feature, but at this time it involves a bit of work.

Specifying the actual custom elements in the job description is trivial. Simply add any elements that you need between the beginning and ending extensions tags at the bottom of the job description as in the following basic example:

<job>
    <executable>/home/user1/myapp</executable>
    <extensions>
        <myData>
            <var1>hello</var1>
            <var2>world</var2>
        </myData>
    </extensions>
</job>

To handle this data, you must alter the appropriate perl scheduler script (i.e. fork.pm for the Fork scheduler, etc...) to parse the data returned from the $description->extensions() sub.

More information about job description extension support can be found here.

4.13. Specifying SoftEnv keys in the job description (4.0.5+ only)

[Note]Note

This feature is only available beginning with version 4.0.5 of the toolkit.

For a short introduction to SoftEnv please have a look at the SoftEnv chapter.

If SoftEnv is enabled on the server-side, nothing needs to be added to a job description to set up the environment which is specified in the .soft file in the remote home directory of the user before the job is submitted to the scheduler.

If a different software environment should be used than the one specified in the remote .soft file, the user must provide SoftEnv parameters in the extensions element of the job description.

The schema of the extension element for software selection in the job description is as follows:

<element name="softenv" type="xsd:string">

For example, to add the SoftEnv commands @teragrid-basic, +intel-compilers, +atlas, and +tgcp to the job process' environment, the user would specify the following <extensions> element in the job description:

<extensions>
  <softenv>@teragrid-basic</softenv>
  <softenv>+intel-compilers</softenv>
  <softenv>+atlas</softenv>
  <softenv>+tgcp</softenv>
</extensions>

So far there is no way for a user to learn from the remote service itself whether or not SoftEnv support is enabled. Currently, the only way to check this is to submit a job with /bin/env as the executable and watch the results.

The following table describes what happens in various scenarios if SoftEnv is disabled or enabled on the server side:

 Disabled on server sideEnabled on server side

User provides no SoftEnv extensions:

No SoftEnv environment is configured before job submission, even if the user has a .soft file in their remote home directory.

If the user has a .soft file (and no .nosoft file) in their remote home directory, then the environment defined in the .soft file will be configured before job submission.

If the user has a .nosoft file in their remote home directory, no environment will be prepared.

User provides valid SoftEnv extensions:

If SoftEnv is not installed on the server then no environment will be configured

If SoftEnv is installed, the environment the user specifies in the <extensions> elements overwrites any SoftEnv configuration the user specifies in a .soft or .nosoft file in their remote home directory. The environment will be configured as specified by the user in the <extensions> elements before job submission.

The specified environment overwrites any SoftEnv configuration the user specifies in a .soft or a .nosoft file in their remote home directory. The environment will be configured as specified by the user in the <extensions> elements before job submission.

User provides invalid SoftEnv extensions:

If SoftEnv is not installed on the server, then no environment will be configured.

If SoftEnv is installed, the environment the user specifies in the <extensions> elements overwrites any SoftEnv configuration the user specifies in a .soft or a .nosoft file in their remote home directory. Only the valid keys in the SoftEnv <extensions> elements will be configured. If no valid key is found, no environment will be configured. SoftEnv warnings are logged to the stdout of the job.

The specified environment overwrites any SoftEnv configuration the user specifies in a .soft or a .nosoft file in their remote home directory. Only the valid keys in the SoftEnv <extensions> elements will be configured. If no valid key is found, no environment will be configured. SoftEnv warnings are logged to stdout of the job.

 

In general, jobs do not fail if they have SoftEnv extensions in their description and SoftEnv is disabled (or not even installed) on the server side. But they will fail if they rely on environments being set up before job submission.

 
[Note]Note

In the current implementation, it is not possible to call executables directly whose paths are defined in SoftEnv without specifiying the complete path to the executable.

For example, if a database query must be executed using the mysql command and mysql is not in the default path, then the direct use of mysql as an executable in the jobs description document will fail, even if the use of SoftEnv is configured. The mysql command must be written to a script which is in the default path.

Thus a job submission with the following job description document will fail:

<job>
  ...
  <executable>mysql</executable>
  ...
</job>
  

But when the command is embedded inside a shell script which is specified as the executable in the job description document, it will work:

#!/bin/sh
  ...
  mysql ...
    ...
  
[Note]Note

The use of invalid SoftEnv keys in the extension part of the job description document does not generate errors.

5. Command-line tools

Please see the GT 4.0 WS GRAM Command-line Reference.

6. Submitting MPI Jobs

This document from DGrid describes how to submit MPI batch jobs to compute clusters using GRAM4.

7. Graphical user interfaces

There is no support for this type of interface for WS GRAM.

8. Troubleshooting

When I submit a streaming or staging job, I get the following error: ERROR service.TransferWork Terminal transfer error: [Caused by: Authentication failed[Caused by: Operation unauthorized(Mechanism level: Authorization failed. Expected"/CN=host/localhost.localdomain" target but received "/O=Grid/OU=GlobusTest/OU=simpleCA-my.machine.com/CN=host/my.machine.com")

  • Check $GLOBUS_LOCATION/etc/gram-service/globus_gram_fs_map_config.xml to see if it uses localhost or 127.0.0.1 instead of the public hostname (in the example above, my.machine.com). Change these uses of the loopback hostname or IP to the public hostname as neccessary.

Fork jobs work fine, but submitting PBS jobs with globusrun-ws hangs at "Current job state: Unsubmitted"

  1. Make sure the log_path in $GLOBUS_LOCATION/etc/globus-pbs.conf points to locally accessible scheduler logs that are readable by the user running the container. The Scheduler Event Generator (SEG) will not work without local scheduler logs to monitor. This can also apply to other resource managers, but is most comonly seen with PBS.

  2. If the SEG configuration looks sane, try running the SEG tests. They are located in $GLOBUS_LOCATION/test/globus_scheduler_event_generator_*_test/. If Fork jobs work, you only need to run the PBS test. Run each test by going to the associated directory and run ./TESTS.pl. If any tests fail, report this to the gram-dev@globus.org mailing list.

  3. If the SEG tests succeed, the next step is to figure out the ID assigned by PBS to the queued job. Enable GRAM debug logging by uncommenting the appropriate line in the $GLOBUS_LOCATION/container-log4j.properties configuration file. Restart the container, run a PBS job, and search the container log for a line that contains "Received local job ID" to obtain the local job ID.

  4. Once you have the local job ID, you can find out if the PBS status is being logged by checking the latest PBS logs pointed to by the value of "log_path" in $GLOBUS_LOCATION/etc/globus-pbs.conf.

    If the status is not being logged, check the documentation for your flavor of PBS to see if there's any futher configuration that needs to be done to enable job status logging. For example, PBS Pro requires a sufficient -e <bitmask> option added to the pbs_server command line to enable enough logging to satisfy the SEG.

  5. If the correct status is being logged, try running the SEG manually to see if it is reading the log file properly. The general form of the SEG command line is as follows:

        $GLOBUS_LOCATION/libexec/globus-scheduler-event-generator -s pbs -t <timestamp>
        

    The timestamp is in seconds since the epoch and dictates how far back in the log history the SEG should scan for job status events. The command should hang after dumping some status data to stdout.

    If no data appears, change the timestamp to an earlier time.

    If nothing ever appears, report this to the gram-user@globus.org mailing list.

  6. If running the SEG manually succeeds, try running another job and make sure the job process actually finishes and PBS has logged the correct status before giving up and cancelling globusrun-ws. If things are still not working, report your problem and exactly what you have tried to remedy the situtation to the gram-user@globus.org mailing list.

The job manager detected an invalid script response

  • Check for a restrictive umask. When the service writes the native scheduler job description to a file, an overly restrictive umask will cause the permissions on the file to be such that the submission script run through sudo as the user cannot read the file (bug #2655).

When restarting the container, I get the following error: Error getting delegation resource

  • Most likely this is simply a case of the delegated credential expiring. Either refresh it for the affected job or destroy the job resource. For more information, see delegation command-line clients.

The user's home directory has not been determined correctly

  • This occurs when the administrator changed the location of the users' home directory and did not restart the GT4 container afterwards. Beginning with version 4.0.3, WS-GRAM determines a user's home directory only once in the lifetime of a container (when the user submits the first job). Subsequently, submitted jobs will use the cached home directory during job execution.

9. Usage statistics collection by the Globus Alliance

The following usage statistics are sent by default in a UDP packet (in addition to the GRAM component code, packet version, timestamp, and source IP address) at the end of each job (i.e. when Done or Failed state is entered).

  • job creation timestamp (helps determine the rate at which jobs are submitted)
  • scheduler type (Fork, PBS, LSF, Condor, etc...)
  • jobCredentialEndpoint present in RSL flag (to determine if server-side user proxies are being used)
  • fileStageIn present in RSL flag (to determine if the staging in of files is used)
  • fileStageOut present in RSL flag (to determine if the staging out of files is used)
  • fileCleanUp present in RSL flag (to determine if the cleaning up of files is used)
  • CleanUp-Hold requested flag (to determine if streaming is being used)
  • job type (Single, Multiple, MPI, or Condor)
  • gt2 error code if job failed (to determine common scheduler script errors users experience)
  • fault class name if job failed (to determine general classes of common faults users experience)

If you wish to disable this feature, please see the Java WS Core System Administrator's Guide section on Usage Statistics Configuration for instructions.

Also, please see our policy statement on the collection of usage statistics.