Galaxy Configuration¶
Examples¶
The most complete and updated documentation for configuring Galaxy job
destinations is Galaxy’s job_conf.xml.sample_advanced
file (check it out on
GitHub).
These examples just provide a different Pulsar-centric perspective on some of the documentation in that file.
Simple Windows Pulsar Web Server¶
The following Galaxy job_conf.xml
assumes you have deployed a simple Pulsar
web server to the Windows host windowshost.examle.com
on the default port
(8913
) with a private_token
(defined in app.yml
) of
123456789changeme
. Most Galaxy jobs will just route use Galaxy’s local job
runner but msconvert
and proteinpilot
will be sent to the Pulsar server
on windowshost.examle.com
. Sophisticated tool dependency resolution is not
available for Windows-based Pulsar servers so ensure the underlying application
are on the Pulsar’s path.
<?xml version="1.0"?>
<job_conf>
<plugins>
<plugin id="local" type="runner" load="galaxy.jobs.runners.local:LocalJobRunner"/>
<plugin id="pulsar" type="runner" load="galaxy.jobs.runners.pulsar:PulsarLegacyJobRunner"/>
</plugins>
<destinations default="local">
<destination id="local" runner="local"/>
<destination id="win_pulsar" runner="pulsar">
<param id="url">https://windowshost.examle.com:8913/</param>
<param id="private_token">123456789changeme</param>
</destination>
</destinations>
<tools>
<tool id="msconvert" destination="win_pulsar" />
<tool id="proteinpilot" destination="win_pulsar" />
</tools>
</job_conf>
Targeting a Linux Cluster (Pulsar Web Server)¶
The following Galaxy job_conf.xml
assumes you have a very typical Galaxy
setup - there is a local, smaller cluster that mounts all of Galaxy’s data (so
no need for the Pulsar) and a bigger shared resource that cannot mount Galaxy’s
files requiring the use of the Pulsar. This variant routes some larger assembly
jobs to the remote cluster - namely the trinity
and abyss
tools.
<?xml version="1.0"?>
<job_conf>
<plugins>
<plugin id="drmaa" type="runner" load="galaxy.jobs.runners.drmaa:DRMAAJobRunner"/>
<plugin id="pulsar" type="runner" load="galaxy.jobs.runners.pulsar:PulsarRESTJobRunner"/>
</plugins>
<destinations default="local_cluster">
<destination id="local_cluster" runner="drmaa">
<param id="native_specification">-P littlenodes -R y -pe threads 4</param>
</destination>
<destination id="remote_cluster" runner="pulsar">
<param id="url">http://remotelogin:8913/</param>
<param id="submit_native_specification">-P bignodes -R y -pe threads 16</param>
<!-- Look for trinity package at remote location - define tool_dependency_dir
in the Pulsar app.yml file.
-->
<param id="dependency_resolution">remote</param>
</destination>
</destinations>
<tools>
<tool id="trinity" destination="remote_cluster" />
<tool id="abyss" destination="remote_cluster" />
</tools>
</job_conf>
For this configuration, on the Pulsar side be sure to also set a
DRMAA_LIBRARY_PATH
in local_env.sh
, install the Python drmaa
module, and configure a DRMAA job manager for Pulsar in app.yml
as described
in Job Managers.
Targeting a Linux Cluster (Pulsar over Message Queue)¶
For Pulsar instances sitting behind a firewall, a web server may be impossible. If
the same Pulsar configuration discussed above is additionally configured with a
message_queue_url
of amqp://rabbituser:rabb8pa8sw0d@mqserver:5672//
in
app.yml
, the following Galaxy configuration will cause this message
queue to be used for communication. This is also likely better for large file
transfers since typically your production Galaxy server will be sitting behind
a high-performance proxy while Pulsar will not.
<?xml version="1.0"?>
<job_conf>
<plugins>
<plugin id="drmaa" type="runner" load="galaxy.jobs.runners.drmaa:DRMAAJobRunner"/>
<plugin id="pulsar_default" type="runner" load="galaxy.jobs.runners.pulsar:PulsarMQJobRunner">
<!-- Must tell Pulsar where to send files. -->
<param id="galaxy_url">https://galaxy.example.org</param>
<!-- Message Queue Connection (should match message_queue_url in Pulsar's app.yml)
-->
<param id="amqp_url">amqp://rabbituser:rabb8pa8sw0d@mqserver.example.org:5672//</param>
</plugin>
<plugin id="pulsar_hugenodes" type="runner" load="galaxy.jobs.runners.pulsar:PulsarMQJobRunner">
<param id="galaxy_url">https://galaxy.example.org</param>
<param id="amqp_url">amqp://rabbituser:rabb8pa8sw0d@mqserver.example.org:5672//</param>
<!-- Set the 'manager' param to reference a named Pulsar job manager -->
<param id="manager">hugenodes</param>
</plugin>
</plugins>
<destinations default="local_cluster">
<destination id="local_cluster" runner="drmaa">
<param id="native_specification">-P littlenodes -R y -pe threads 4</param>
</destination>
<destination id="bignodes_cluster" runner="pulsar_default">
<!-- Tell Galaxy where files are being stored on remote system, so
the web server can simply ask for this information.
-->
<param id="jobs_directory">/path/to/remote/pulsar/files/staging/</param>
<!-- Remaining parameters same as previous example -->
<param id="submit_native_specification">-P bignodes -R y -pe threads 16</param>
</destination>
<destination id="hugenodes_cluster" runner="pulsar_hugenodes">
<param id="jobs_directory">/path/to/remote/pulsar/files/staging/</param>
<param id="submit_native_specification">-P hugenodes -R y -pe threads 128</param>
<destination>
</destinations>
<tools>
<tool id="trinity" destination="bignodes_cluster" />
<tool id="abyss" destination="hugenodes_cluster" />
</tools>
</job_conf>
The manager
param to the PulsarMQJobRunner
plugin allows for using the
same AMQP server and vhost (in this example, the default /
vhost) between
multiple Pulsar servers, or submitting jobs to multiple managers (see:
Job Managers) on the same Pulsar server.
In this example, the _default_
job manager will be used for trinity
jobs, and the hugenodes
job manager will be used for abyss
jobs.
Note
If you only need to define different submit_native_specification
params
on the same cluster for these tools/destinations, it is not necessary to use
a separate manager - multiple destinations can reference the same plugin.
This example is for documentation purposes.
All of the amqp_*
options documented in app.yml.sample can be specified
as params to the PulsarMQJobRunner
plugin. These configure Galaxy’s
connection to the AMQP server (rather than Pulsar’s connection, which is
configured in Pulsar’s app.yml
). Additionally, specifying the
persistence_directory
param controls where AMQP acknowledgement receipts
will be stored on the Galaxy side.
For those interested in this deployment option and new to Message Queues, there is more documentation in Message Queues with Galaxy and Pulsar.
Additionally, Pulsar ships with an RSync and SCP transfer action rather than making use of the HTTP transport method:
<?xml version="1.0"?>
<job_conf>
<plugins>
<plugin id="pulsar_mq" type="runner" load="galaxy.jobs.runners.pulsar:PulsarMQJobRunner">
<!-- Must tell Pulsar where to send files. -->
<param id="galaxy_url">https://galaxyserver</param>
<!-- Message Queue Connection (should match message_queue_url in
Pulsar's app.yml). pyamqp may be necessary over amqp if SSL is used
-->
<param id="amqp_url">pyamqp://rabbituser:rabb8pa8sw0d@mqserver:5671//?ssl=1</param>
</plugin>
</plugins>
<destinations default="pulsar_mq">
<destination id="remote_cluster" runner="pulsar_mq">
<!-- This string is replaced by Pulsar, removing the requirement
of coordinating Pulsar installation directory between cluster
admin and galaxy admin
-->
<param id="jobs_directory">__PULSAR_JOBS_DIRECTORY__</param>
<!-- Provide connection information, should look like:
paths:
- path: /home/vagrant/ # Home directory for galaxy user
action: remote_rsync_transfer # _rsync_ and _scp_ are available
ssh_user: vagrant
ssh_host: galaxy-vm.host.edu
ssh_port: 22
-->
<param id="file_action_config">file_actions.yaml</param>
<!-- Provide an SSH key for access to the local $GALAXY_ROOT,
should be accessible with the username/hostname provided in
file_actions.yaml
-->
<param id="ssh_key">-----BEGIN RSA PRIVATE KEY-----
.............
</param>
<!-- Allow the remote end to know who is running the job, may need
to append @domain.edu after it. Only used if the
"DRMAA (via external users) manager" is used
-->
<param id="submit_user">$__user_name__</param>
</destination>
</destinations>
<tools>
<tool id="trinity" destination="remote_cluster" />
<tool id="abyss" destination="remote_cluster" />
</tools>
</job_conf>
Targeting Apache Mesos (Prototype)¶
See commit message for initial work on this and this post on galaxy-dev.
Generating Galaxy Metadata in Pulsar Jobs¶
This option is often referred to as remote metadata.
Typically Galaxy will process Pulsar job outputs and generate metadata on the Galaxy server. One can force this to happen inside Pulsar jobs (wherever the Pulsar job runs). This is similar to the way that non-Pulsar Galaxy jobs work: job output metadata is generated at the end of a standard Galaxy job, not by the Galaxy server.
This option comes with a downside that you should be aware of, explained in Issue #234. Unless you are seeing high load on your Galaxy server while finishing Pulsar jobs, it is safest to use the default (remote metadata disabled).
In order to enable the remote metadata option:
Set
GALAXY_VIRTUAL_ENV
to the path to Galaxy’s virtualenv (or one containing Galaxy’s dependencies) when starting Pulsar. This can be done in thelocal_env.sh
file. Instructions on setting up a Galaxy virtualenv can be found in the Galaxy Docs.Instruct Pulsar with the path to a copy of Galaxy at the same version as your Galaxy server. This can either be done by setting
GALAXY_HOME
inlocal_env.sh
, or by settinggalaxy_home
inapp.yml
.In the Galaxy
job_conf.xml
destination(s) you want to enable remote metadata on, set the following params:<param id="remote_metadata">true</param> <param id="remote_property_galaxy_home">/path/to/galaxy</param>
and one of either:
<param id="use_metadata_binary">true</param>
or:
<param id="use_remote_datatypes">false</param>
Data Staging¶
Most of the parameters settable in Galaxy’s job configuration file
job_conf.xml
are straight forward - but specifying how Galaxy and the Pulsar
stage various files may benefit from more explanation.
default_file_action
defined in Galaxy’s job_conf.xml describes how
inputs, outputs, indexed reference data, etc… are staged. The default
transfer
has Galaxy initiate HTTP transfers. This makes little sense in the
context of message queues so this should be set to remote_transfer
, which
causes Pulsar to initiate the file transfers. Additional options are available
including none
, copy
, and remote_copy
.
In addition to this default - paths may be overridden based on various patterns to allow optimization of file transfers in production infrastructures where various systems mount different file stores and file stores with different paths on different systems.
To do this, the defined Pulsar destination in Galaxy’s job_conf.xml
may
specify a parameter named file_action_config
. This needs to be a config
file path (if relative, relative to Galaxy’s root) like
config/pulsar_actions.yaml
(can be YAML or JSON - but older Galaxy’s only
supported JSON). The following captures available options:
paths:
# Use transfer (or remote_transfer) if only Galaxy mounts a directory.
- path: /galaxy/files/store/1
action: transfer
# Use copy (or remote_copy) if remote Pulsar server also mounts the directory
# but the actual compute servers do not.
- path: /galaxy/files/store/2
action: copy
# If Galaxy, the Pulsar, and the compute nodes all mount the same directory
# staging can be disabled altogether for given paths.
- path: /galaxy/files/store/3
action: none
# Following block demonstrates specifying paths by globs as well as rewriting
# unstructured data in .loc files.
- path: /mnt/indices/**/bwa/**/*.fa
match_type: glob
path_types: unstructured # Set to *any* to apply to defaults & unstructured paths.
action: transfer
depth: 1 # Stage whole directory with job and not just file.
# Following block demonstrates rewriting paths without staging. Useful for
# instance if Galaxy's data indices are mounted on both servers but with
# different paths.
- path: /galaxy/data
path_types: unstructured
action: rewrite
source_directory: /galaxy/data
destination_directory: /work/galaxy/data
# The following demonstrates use of the Rsync transport layer
- path: /galaxy/files/
action: remote_rsync_transfer
# Additionally the action remote_scp_transfer is available which behaves in
# an identical manner
ssh_user: galaxy
ssh_host: f.q.d.n
ssh_port: 22
# See action_mapper.py for explaination of mapper types:
# - input: Galaxy input datasets and extra files.
# - config: Galaxy config and param files.
# - tool: Files from tool's tool_dir (for now just wrapper if available).
# - workdir: Input work dir files - e.g.task-split input file.
# - metadata: Input metadata files.
# - output: Galaxy output datasets in their final home.
# - output_workdir: Galaxy from_work_dir output paths and other files (e.g. galaxy.json)
# - output_metadata: Meta job and data files (e.g. Galaxy metadata generation files and
# metric instrumentation files)
# - unstructured: Other fixed tool parameter paths (likely coming from tool data, but not
# nessecarily). Not sure this is the best name...