The following documents are good places to start with Globus Toolkit 5.0.0:
For system administrators and those installing GT:
For End Users:
GT 5.0.0 release manuals are divided into the following areas:
Data management tools are concerned with the location, transfer, and management of distributed data. GT5 provides various basic tools, including GridFTP for high-performance and reliable data transport, and RLS for maintaining location information for replicated files.
Execution management tools are concerned with the initiation, monitoring, management, scheduling, and/or coordination of remote computations. GT5 supports the Grid Resource Allocation and Management (GRAM5) interface as a basic mechanism for these purposes.
These components establish the identity of users or services (authentication), protect communications, and determine who is allowed to perform what actions (authorization), as well as manage user credentials.
These components include a set of C Common libraries needed for building grid infrastructure and XIO, a sophisticated and extensible I/O library suitable for the dynamic needs of grid applications.
- batch scheduler
See the definition for scheduler
- Bloom filter
Compression scheme used by the Replica Location Service (RLS) that is intended to reduce the size of soft state updates between Local Replica Catalogs (LRCs) and Replica Location Index (RLI) servers. A Bloom filter is a bit map that summarizes the contents of a Local Replica Catalog (LRC). An LRC constructs the bit map by applying a series of hash functions to each logical name registered in the LRC and setting the corresponding bits.
- Certificate Authority ( CA )
An entity that issues certificates.
- CA Certificate
The CA's certificate. This certificate is used to verify signature on certificates issued by the CA. GSI typically stores a given CA certificate in
/etc/grid-security/certificates/, where <hash> is the hash code of the CA identity.
- CA Signing Policy
The CA signing policy is used to place constraints on the information you trust a given CA to bind to public keys. Specifically it constrains the identities a CA is trusted to assert in a certificate. In GSI the signing policy for a given CA can typically be found in
/etc/grid-security/certificates/, where <hash> is the hash code of the CA identity.
A public key plus information about the certificate owner bound together by the digital signature of a CA. In the case of a CA certificate, the certificate is self signed, i.e. it was signed using its own private key.
- Certificate Revocation List (CRL)
A list of revoked certificates generated by the CA that originally issued them. When using GSI, this list is typically found in
/etc/grid-security/certificates/, where <hash> is the hash code of the CA identity.
- certificate subject
An identifier for the certificate owner, e.g. "
/DC=org/DC=doegrids/OU=People/CN=John Doe 123456". The subject is part of the information the CA binds to a public key when creating a certificate.
A process that sends commands and receives responses. Note that in GridFTP, the client may or may not take part in the actual movement of data.
- client/server transfer
In a client/server transfer, there are only two entities involved in the transfer, the client entity and the server entity. We use the term entity here rather than process because in the implementation provided in GT5, the server entity may actually run as two or more separate processes.
The client will either move data from or to his local host. The client will decide whether or not he wishes to connect to the server to establish the data channel or the server should connect to him (MODE E dictates who must connect).
If the client wishes to connect to the server, he will send the PASV (passive) command. The server will start listening on an ephemeral (random, non-privileged) port and will return the IP and port as a response to the command. The client will then connect to that IP/Port.
If the client wishes to have the server connect to him, the client would start listening on an ephemeral port, and would then send the PORT command which includes the IP/Port as part of the command to the server and the server would initiate the TCP connect. Note that this decision has an impact on traversing firewalls. For instance, the client's host may be behind a firewall and the server may not be able to connect.
Finally, now that the data channel is established, the client will send either the RETR “filename” command to transfer a file from the server to the client (GET), or the STOR “filename” command to transfer a file from the client to the server (PUT).
- command line interface (CLI)
A mechanism for interacting with a computer operating system or software by typing commands to run programs, as opposed to using a mouse pointer on a graphical user interface (GUI).
Both FTP and GridFTP are command/response protocols. What this means is that once a client sends a command to the server, it can only accept responses from the server until it receives a response indicating that the server is finished with that command. For most commands this is not a big deal. For instance, setting the type of the file transfer to binary (called "I" for "image in the protocol"), simply consists of the client sending TYPE I and the server responding with 220 OK. Type set to I. However, the SEND and RETR commands (which actually initiate the movement of data) can run for a long time. Once the command is sent, the client’s only options are to wait until it receives the completion reply, or kill the transfer.
When speaking of GridFTP transfers, concurrency refers to having multiple files in transit at the same time. They may all be on the same host or across multiple hosts. This is equivalent to starting up “n” different clients for “n” different files, and having them all running at the same time. This can be effective if you have many small files to move. The Reliable File Transfer (RFT) service utilizes concurrency to improve its performance.
A job scheduler mechanism supported by GRAM. See http://www.cs.wisc.edu/condor/ for more information.
- control channel
The Communication link (TCP) over which commands and responses flow.
Low bandwidth; encrypted and integrity protected by default.
The combination of a certificate and the matching private key.
- Concurrent Version System (CVS)
Source code repository used by the Globus Toolkit.
- data channel
Communication link(s) over which the actual data of interest flows.
High Bandwidth; authenticated by default; encryption and integrity protection optional.
- dual channel protocol
GridFTP uses two channels:
- One of the channels, called the control channel, is used for sending commands and responses. It is low bandwidth and is encrypted for security reasons.
- The second channel is known as the data channel. Its sole purpose is to transfer the data. It is high bandwidth and uses an efficient protocol.
By default, the data channel is authenticated at connection time, but no integrity checking or encryption is performed due to performance reasons. Integrity checking and encryption are both available via the client and libraries.
Note that in GridFTP (not FTP) the data channel may actually consist of several TCP streams from multiple hosts.
- End Entity Certificate (EEC)
A certificate belonging to a non-CA entity, e.g. you, me or the computer on your desk.
- extended block mode (MODE E)
MODE E is a critical GridFTP components because it allows for out of order reception of data. This in turn, means we can send the data down multiple paths and do not need to worry if one of the paths is slower than the others and the data arrives out of order. This enables parallelism and striping within GridFTP. In MODE E, a series of “blocks” are sent over the data channel. Each block consists of:
- an 8 bit flag field,
- a 64 bit field indicating the offset in the transfer,
- and a 64 bit field indicating the length of the payload,
- followed by length bytes of payload.
Note that since the offset and length are included in the block, out of order reception is possible, as long as the receiving side can handle it, either via something like a seek on a file, or via some application level buffering and ordering logic that will wait for the out of order blocks.
- GAA configuration file
A file that configures the Generic Authorization and Access control GAA libraries. When using GSI, this file is typically found in
- Grid Resource Allocation and Management (GRAM)
This component is used to locate, submit, monitor, and cancel jobs on Grid computing resources.
- grid map file
A file containing entries mapping certificate subjects to local user names. This file can also serve as a access control list for GSI enabled services and is typically found in
/etc/grid-security/grid-mapfile. For more information see the Gridmap section here.
- grid security directory
The directory containing GSI configuration files such as the GSI authorization callout configuration and GAA configuration files. Typically this directory is
/etc/grid-security. For more information see this.
- Grid Security Infrastructure (GSI)
GSI stands for Grid Security Infrastructure and is used to describe the original infrastructure of GT security, which is comprised of SSL, PKI and proxy certificates.
- GSI authorization callout configuration file
A file that configures authorization callouts to be used for mapping and authorization in GSI enabled services. When using GSI this file is typically found in
- host certificate
An EEC belonging to a host. When using GSI this certificate is typically stored in
/etc/grid-security/hostcert.pem. For more information on possible host certificate locations see the GSI C Developer's Guide.
- host credentials
The combination of a host certificate and its corresponding private key.
- job description
Term used to describe a GRAM5 job for GT5.
- job scheduler
See the term scheduler.
Libtool is a GNU library support script that abstracts shared library interface. Used by GSI/Sysconfig. Libtool hides the complexity of using shared libraries behind a portable interface.
For more information, see http://www.gnu.org/software/libtool.
See also GSI.
- Local Replica Catalog (LRC)
Stores mappings between logical names for data items and the target names (often the physical locations) of replicas of those items. Clients query the LRC to discover replicas associated with a logical name. Also may associate attributes with logical or target names. Each LRC periodically sends information about its logical name mappings to one or more RLIs.
See also RLI.
- logical file name
A unique identifier for the contents of a file.
- logical name
A unique identifier for the contents of a data item.
A job scheduler mechanism supported by GRAM.
For more information, see http://www.platform.com/Products/Platform.LSF.Family/Platform.LSF/.
A program that typically operates at a higher level than a job scheduler (typically, above the GRAM level). It schedules and submits jobs to GRAM services.
- Message Passing Interface (MPI)
The Message Passing Interface (MPI) is a library specification for message-passing, proposed as a standard by a broadly based committee of vendors, implementors, and users.
For more information, see http://www-unix.mcs.anl.gov/mpi/.
- MODE command
In reality, GridFTP is not one protocol, but a collection of several protocols. There is a protocol used on the control channel, but there is a range of protocols available for use on the data channel. Which protocol is used is selected by the MODE command. Four modes are defined: STREAM (S), BLOCK (B), COMPRESSED (C) in RFC 959 for FTP, and EXTENDED BLOCK (E) in GFD.020 for GridFTP. There is also a new data channel protocol, or mode, being defined in the GGF GridFTP Working group which, for lack of a better name at this point, is called MODE X.
See also extended block mode (MODE E).
See also stream mode (MODE S).
- network end points
A network endpoint is generally something that has an IP address (a network interface card). It is a point of access to the network for transmission or reception of data. Note that a single host could have multiple network end points if it has multiple NICs installed (multi-homed). This definition is necessary to differentiate between parallelism and striping.
- Network File System (NFS)
The Network File System (NFS) provides remote access to shared file systems across networks.
For more information, see http://www.uwsg.iu.edu/usail/network/nfs/overview.html.
When speaking about GridFTP transfers, parallelism refers to having multiple TCP connections between a single pair of network endpoints. This is used to improve performance of transfers on connections with light to moderate packet loss.
- Portable Batch System (PBS)
A job scheduler mechanism supported by GRAM. For more information, see http://www.openpbs.org.
- physical file name
The address or the location of a copy of a file on a storage system.
- private key
The private part of a key pair. Depending on the type of certificate the key corresponds to it may typically be found in
$HOME/.globus/userkey.pem(for user certificates),
/etc/grid-security/hostkey.pem(for host certificates) or
/etc/grid-security/(for service certificates).
For more information on possible private key locations see this.
- proxy certificate
A short lived certificate issued using a EEC. A proxy certificate typically has the same effective subject as the EEC that issued it and can thus be used in its place. GSI uses proxy certificates for single sign on and delegation of rights to other entities.
For more information about types of proxy certificates and their compatibility in different versions of GT, see http://dev.globus.org/wiki/Security/ProxyCertTypes.
- proxy credentials
The combination of a proxy certificate and its corresponding private key. GSI typically stores proxy credentials in
/tmp/x509up_u, where <uid> is the user id of the proxy owner.
- public key
The public part of a key pair used for cryptographic operations (e.g. signing, encrypting).
- Replica Location Index (RLI)
Collects information about the logical name mappings stored in one or more Local Replica Catalogs (LRCs) and answers queries about those mappings. Each RLI periodically receives updates from one or more LRCs that summarize their contents.
- Replica Location Service (RLS)
A distributed registry that keeps track of where replicas exist on physical storage systems. The job of the RLS is to maintain associations, or mappings, between logical names for data objects and one or more target or physical names for replicas. Users or services register data items in the RLS and query RLS servers to find replicas.
- Resource Specification Language (RSL)
Term used to describe a GRAM job for GT2 and GT3. (Note: This is not the same as RLS - the Replica Location Service)
- RLS attribute
Descriptive information that may be associated with a logical or target name mapping registered in a Local Replica Catalog (LRC). Clients can query the LRC to discover logical names or target names that have specified RLS attributes.
Utilized by GSI. Open Source Simple Authentication and Security Layer in the C Language. For more information, see http://asg.web.cmu.edu/sasl.
Term used to describe a job scheduler mechanism to which GRAM interfaces. It is a networked system for submitting, controlling, and monitoring the workload of batch jobs in one or more computers. The jobs or tasks are scheduled for execution at a time chosen by the subsystem according to an available policy and availability of resources. Popular job schedulers include Portable Batch System (PBS), Platform LSF, and IBM LoadLeveler.
- scheduler adapter
The interface used by GRAM to communicate/interact with a job scheduler mechanism. In GT 4.x, this is both the perl submission scripts and the SEG program.
- Scheduler Event Generator (SEG)
The Scheduler Event Generator (SEG) is a program which uses scheduler-specific monitoring modules to generate job state change events. Depending on scheduler-specific requirements, the SEG may need to run with privileges to enable it to obtain scheduler event notifications. As such, one SEG runs per scheduler resource. For example, on a host which provides access to both PBS and fork jobs, two SEGs, running at (potentially) different privilege levels will be running. One SEG instance exists for any particular scheduled resource instance (one for all homogeneous PBS queues, one for all fork jobs, etc). The SEG is implemented in an executable called the globus-scheduler-event-generator, located in the Globus Toolkit's libexec directory.
A process that receives commands and sends responses to those commands. Since it is a server or service, and it receives commands, it must be listening on a port somewhere to receive the commands. Both FTP and GridFTP have IANA registered ports. For FTP it is port 21, for GridFTP it is port 2811. This is normally handled via inetd or xinetd on Unix variants. However, it is also possible to implement a daemon that listens on the specified port. This is described more fully in in the Architecture section of the GridFTP Developer's Guide.
- service certificate
A EEC for a specific service (e.g. FTP or LDAP). When using GSI this certificate is typically stored in
/etc/grid-security/. For more information on possible service certificate locations, see this.
- service credentials
The combination of a service certificate and its corresponding private key.
- stream mode (MODE S)
The only mode normally implemented for FTP is MODE S. This is simply sending each byte, one after another over the socket in order, with no application level framing of any kind. This is the default and is what a standard FTP server will use. This is also the default for GridFTP.
When speaking about GridFTP transfers, striping refers to having multiple network endpoints at the source, destination, or both participating in the transfer of the same file. This is normally accomplished by having a cluster with a parallel shared file system. Each node in the cluster reads a section of the file and sends it over the network. This mode of transfer is necessary if you wish to transfer a single file faster than a single host is capable of. This also tends to only be effective for large files, though how large depends on how many hosts and how fast the end-to-end transfer is. Note that while it is theoretically possible to use NFS for the shared file system, your performance will be poor, and would make using striping pointless.
- target name
The address or location of a copy of a data item on a storage system.
- third party transfers
In the simplest terms, a third party transfer moves a file between two GridFTP servers.
The following is a more detailed, programmatic description.
In a third party transfer, there are three entities involved. The client, who will only orchestrate, but not actually take place in the data transfer, and two servers one of which will be sending data to the other. This scenario is common in Grid applications where you may wish to stage data from a data store somewhere to a supercomputer you have reserved. The commands are quite similar to the client/server transfer. However, now the client must establish two control channels, one to each server. He will then choose one to listen, and send it the PASV command. When it responds with the IP/port it is listening on, the client will send that IP/port as part of the PORT command to the other server. This will cause the second server to connect to the first server, rather than the client. To initiate the actual movement of the data, the client then sends the RETR “filename” command to the server that will read from disk and write to the network (the “sending” server) and will send the STOR “filename” command to the other server which will read from the network and write to the disk (the “receiving” server).
See Also client/server transfer.
- transport-level security
Uses transport-level security (TLS) mechanisms.
- trusted CAs directory
The directory containing the CA certificates and signing policy files of the CAs trusted by GSI. Typically this directory is
/etc/grid-security/certificates. For more information see this.
- Universally Unique Identifier (UUID)
Identifier that is immutable and unique across time and space.
- user certificate
A EEC belonging to a user. When using GSI, this certificate is typically stored in
$HOME/.globus/usercert.pem. For more information on possible user certificate locations, see this.
- user credentials
The combination of a user certificate and its corresponding private key.