Galaxy Configuration

Examples

The most complete and updated documentation for configuring Galaxy job destinations is Galaxy’s job_conf.xml.sample_advanced file (check it out on GitHub). These examples just provide a different Pulsar-centric perspective on some of the documentation in that file.

Simple Windows Pulsar Web Server

The following Galaxy job_conf.xml assumes you have deployed a simple Pulsar web server to the Windows host windowshost.examle.com on the default port (8913) with a private_token (defined in app.yml) of 123456789changeme. Most Galaxy jobs will just route use Galaxy’s local job runner but msconvert and proteinpilot will be sent to the Pulsar server on windowshost.examle.com. Sophisticated tool dependency resolution is not available for Windows-based Pulsar servers so ensure the underlying application are on the Pulsar’s path.

<?xml version="1.0"?>
<job_conf>
    <plugins>
        <plugin id="local" type="runner" load="galaxy.jobs.runners.local:LocalJobRunner"/>
        <plugin id="pulsar" type="runner" load="galaxy.jobs.runners.pulsar:PulsarLegacyJobRunner"/>
    </plugins>
    <destinations default="local">
        <destination id="local" runner="local"/>
        <destination id="win_pulsar" runner="pulsar">
            <param id="url">https://windowshost.examle.com:8913/</param>
            <param id="private_token">123456789changeme</param>
        </destination>
    </destinations>
    <tools>
        <tool id="msconvert" destination="win_pulsar" />
        <tool id="proteinpilot" destination="win_pulsar" />
    </tools>
</job_conf>

Targeting a Linux Cluster (Pulsar Web Server)

The following Galaxy job_conf.xml assumes you have a very typical Galaxy setup - there is a local, smaller cluster that mounts all of Galaxy’s data (so no need for the Pulsar) and a bigger shared resource that cannot mount Galaxy’s files requiring the use of the Pulsar. This variant routes some larger assembly jobs to the remote cluster - namely the trinity and abyss tools.

<?xml version="1.0"?>
<job_conf>
    <plugins>
        <plugin id="drmaa" type="runner" load="galaxy.jobs.runners.drmaa:DRMAAJobRunner"/>
        <plugin id="pulsar" type="runner" load="galaxy.jobs.runners.pulsar:PulsarRESTJobRunner"/>
    </plugins>
    <destinations default="local_cluster">
        <destination id="local_cluster" runner="drmaa">
            <param id="native_specification">-P littlenodes -R y -pe threads 4</param>
        </destination>
        <destination id="remote_cluster" runner="pulsar">
            <param id="url">http://remotelogin:8913/</param>
            <param id="submit_native_specification">-P bignodes -R y -pe threads 16</param>
            <!-- Look for trinity package at remote location - define tool_dependency_dir
            in the Pulsar app.yml file.
            -->
            <param id="dependency_resolution">remote</param>
        </destination>
    </destinations>
    <tools>
        <tool id="trinity" destination="remote_cluster" />
        <tool id="abyss" destination="remote_cluster" />
    </tools>
</job_conf>

For this configuration, on the Pulsar side be sure to also set a DRMAA_LIBRARY_PATH in local_env.sh, install the Python drmaa module, and configure a DRMAA job manager for Pulsar in app.yml as described in Job Managers.

Targeting a Linux Cluster (Pulsar over Message Queue)

For Pulsar instances sitting behind a firewall, a web server may be impossible. If the same Pulsar configuration discussed above is additionally configured with a message_queue_url of amqp://rabbituser:rabb8pa8sw0d@mqserver:5672// in app.yml, the following Galaxy configuration will cause this message queue to be used for communication. This is also likely better for large file transfers since typically your production Galaxy server will be sitting behind a high-performance proxy while Pulsar will not.

<?xml version="1.0"?>
<job_conf>
    <plugins>
        <plugin id="drmaa" type="runner" load="galaxy.jobs.runners.drmaa:DRMAAJobRunner"/>
        <plugin id="pulsar_default" type="runner" load="galaxy.jobs.runners.pulsar:PulsarMQJobRunner">
            <!-- Must tell Pulsar where to send files. -->
            <param id="galaxy_url">https://galaxy.example.org</param>
            <!-- Message Queue Connection (should match message_queue_url in Pulsar's app.yml)
            -->
            <param id="amqp_url">amqp://rabbituser:rabb8pa8sw0d@mqserver.example.org:5672//</param>
    	</plugin>
        <plugin id="pulsar_hugenodes" type="runner" load="galaxy.jobs.runners.pulsar:PulsarMQJobRunner">
            <param id="galaxy_url">https://galaxy.example.org</param>
            <param id="amqp_url">amqp://rabbituser:rabb8pa8sw0d@mqserver.example.org:5672//</param>
            <!-- Set the 'manager' param to reference a named Pulsar job manager -->
            <param id="manager">hugenodes</param>
        </plugin>
    </plugins>
    <destinations default="local_cluster">
        <destination id="local_cluster" runner="drmaa">
            <param id="native_specification">-P littlenodes -R y -pe threads 4</param>
        </destination>
        <destination id="bignodes_cluster" runner="pulsar_default">
            <!-- Tell Galaxy where files are being stored on remote system, so
                 the web server can simply ask for this information.
            -->
            <param id="jobs_directory">/path/to/remote/pulsar/files/staging/</param>
            <!-- Remaining parameters same as previous example -->
            <param id="submit_native_specification">-P bignodes -R y -pe threads 16</param>
        </destination>
        <destination id="hugenodes_cluster" runner="pulsar_hugenodes">
            <param id="jobs_directory">/path/to/remote/pulsar/files/staging/</param>
            <param id="submit_native_specification">-P hugenodes -R y -pe threads 128</param>
	<destination>
    </destinations>
    <tools>
        <tool id="trinity" destination="bignodes_cluster" />
        <tool id="abyss" destination="hugenodes_cluster" />
    </tools>
</job_conf>

The manager param to the PulsarMQJobRunner plugin allows for using the same AMQP server and vhost (in this example, the default / vhost) between multiple Pulsar servers, or submitting jobs to multiple managers (see: Job Managers) on the same Pulsar server.

In this example, the _default_ job manager will be used for trinity jobs, and the hugenodes job manager will be used for abyss jobs.

Note

If you only need to define different submit_native_specification params on the same cluster for these tools/destinations, it is not necessary to use a separate manager - multiple destinations can reference the same plugin. This example is for documentation purposes.

All of the amqp_* options documented in app.yml.sample can be specified as params to the PulsarMQJobRunner plugin. These configure Galaxy’s connection to the AMQP server (rather than Pulsar’s connection, which is configured in Pulsar’s app.yml). Additionally, specifying the persistence_directory param controls where AMQP acknowledgement receipts will be stored on the Galaxy side.

For those interested in this deployment option and new to Message Queues, there is more documentation in Message Queues with Galaxy and Pulsar.

Additionally, Pulsar ships with an RSync and SCP transfer action rather than making use of the HTTP transport method:

<?xml version="1.0"?>
<job_conf>
    <plugins>
        <plugin id="pulsar_mq" type="runner" load="galaxy.jobs.runners.pulsar:PulsarMQJobRunner">
            <!-- Must tell Pulsar where to send files. -->
            <param id="galaxy_url">https://galaxyserver</param>
            <!-- Message Queue Connection (should match message_queue_url in
                 Pulsar's app.yml). pyamqp may be necessary over amqp if SSL is used
            -->
            <param id="amqp_url">pyamqp://rabbituser:rabb8pa8sw0d@mqserver:5671//?ssl=1</param>
        </plugin>
    </plugins>
    <destinations default="pulsar_mq">
        <destination id="remote_cluster" runner="pulsar_mq">
            <!-- This string is replaced by Pulsar, removing the requirement
                 of coordinating Pulsar installation directory between cluster
                 admin and galaxy admin
            -->
            <param id="jobs_directory">__PULSAR_JOBS_DIRECTORY__</param>
            <!-- Provide connection information, should look like:

                    paths:
                        - path: /home/vagrant/  # Home directory for galaxy user
                          action: remote_rsync_transfer # _rsync_ and _scp_ are available
                          ssh_user: vagrant
                          ssh_host: galaxy-vm.host.edu
                          ssh_port: 22

            -->
             <param id="file_action_config">file_actions.yaml</param>
             <!-- Provide an SSH key for access to the local $GALAXY_ROOT,
            should be accessible with the username/hostname provided in
            file_actions.yaml
             -->
             <param id="ssh_key">-----BEGIN RSA PRIVATE KEY-----
            .............
            </param>
            <!-- Allow the remote end to know who is running the job, may need
                 to append @domain.edu after it. Only used if the
                 "DRMAA (via external users) manager" is used
             -->
            <param id="submit_user">$__user_name__</param>
        </destination>
    </destinations>
    <tools>
        <tool id="trinity" destination="remote_cluster" />
        <tool id="abyss" destination="remote_cluster" />
    </tools>
</job_conf>

Targeting Apache Mesos (Prototype)

See commit message for initial work on this and this post on galaxy-dev.

Generating Galaxy Metadata in Pulsar Jobs

This option is often referred to as remote metadata.

Typically Galaxy will process Pulsar job outputs and generate metadata on the Galaxy server. One can force this to happen inside Pulsar jobs (wherever the Pulsar job runs). This is similar to the way that non-Pulsar Galaxy jobs work: job output metadata is generated at the end of a standard Galaxy job, not by the Galaxy server.

This option comes with a downside that you should be aware of, explained in Issue #234. Unless you are seeing high load on your Galaxy server while finishing Pulsar jobs, it is safest to use the default (remote metadata disabled).

In order to enable the remote metadata option:

  1. Set GALAXY_VIRTUAL_ENV to the path to Galaxy’s virtualenv (or one containing Galaxy’s dependencies) when starting Pulsar. This can be done in the local_env.sh file. Instructions on setting up a Galaxy virtualenv can be found in the Galaxy Docs.

  2. Instruct Pulsar with the path to a copy of Galaxy at the same version as your Galaxy server. This can either be done by setting GALAXY_HOME in local_env.sh, or by setting galaxy_home in app.yml.

  3. In the Galaxy job_conf.xml destination(s) you want to enable remote metadata on, set the following params:

    <param id="remote_metadata">true</param>
    <param id="remote_property_galaxy_home">/path/to/galaxy</param>
    

    and one of either:

    <param id="use_metadata_binary">true</param>
    

    or:

    <param id="use_remote_datatypes">false</param>
    

Data Staging

Most of the parameters settable in Galaxy’s job configuration file job_conf.xml are straight forward - but specifying how Galaxy and the Pulsar stage various files may benefit from more explanation.

default_file_action defined in Galaxy’s job_conf.xml describes how inputs, outputs, indexed reference data, etc… are staged. The default transfer has Galaxy initiate HTTP transfers. This makes little sense in the context of message queues so this should be set to remote_transfer, which causes Pulsar to initiate the file transfers. Additional options are available including none, copy, and remote_copy.

In addition to this default - paths may be overridden based on various patterns to allow optimization of file transfers in production infrastructures where various systems mount different file stores and file stores with different paths on different systems.

To do this, the defined Pulsar destination in Galaxy’s job_conf.xml may specify a parameter named file_action_config. This needs to be a config file path (if relative, relative to Galaxy’s root) like config/pulsar_actions.yaml (can be YAML or JSON - but older Galaxy’s only supported JSON). The following captures available options:

paths: 
  # Use transfer (or remote_transfer) if only Galaxy mounts a directory.
  - path: /galaxy/files/store/1
    action: transfer

  # Use copy (or remote_copy) if remote Pulsar server also mounts the directory
  # but the actual compute servers do not.
  - path: /galaxy/files/store/2
    action: copy

  # If Galaxy, the Pulsar, and the compute nodes all mount the same directory
  # staging can be disabled altogether for given paths.
  - path: /galaxy/files/store/3
    action: none

  # Following block demonstrates specifying paths by globs as well as rewriting
  # unstructured data in .loc files.
  - path: /mnt/indices/**/bwa/**/*.fa
    match_type: glob
    path_types: unstructured  # Set to *any* to apply to defaults & unstructured paths.
    action: transfer
    depth: 1  # Stage whole directory with job and not just file.

  # Following block demonstrates rewriting paths without staging. Useful for
  # instance if Galaxy's data indices are mounted on both servers but with
  # different paths.
  - path: /galaxy/data
    path_types: unstructured
    action: rewrite
    source_directory: /galaxy/data
    destination_directory: /work/galaxy/data

  # The following demonstrates use of the Rsync transport layer
  - path: /galaxy/files/
    action: remote_rsync_transfer
    # Additionally the action remote_scp_transfer is available which behaves in
    # an identical manner
    ssh_user: galaxy
    ssh_host: f.q.d.n
    ssh_port: 22

# See action_mapper.py for explaination of mapper types:
# - input: Galaxy input datasets and extra files.
# - config: Galaxy config and param files.
# - tool: Files from tool's tool_dir (for now just wrapper if available).
# - workdir: Input work dir files - e.g.task-split input file.
# - metadata: Input metadata files.
# - output: Galaxy output datasets in their final home.
# - output_workdir:  Galaxy from_work_dir output paths and other files (e.g. galaxy.json)
# - output_metadata: Meta job and data files (e.g. Galaxy metadata generation files and
#                    metric instrumentation files)
# - unstructured: Other fixed tool parameter paths (likely coming from tool data, but not
#                 nessecarily). Not sure this is the best name...