What is a process?

A process in BioMAJ corresponds to apply some transformations (blast indexing, emboss indexing,… )to the data bank.

Process handlers

  • The process manager handles their execution sequentially or in parallel
  • The processes correspond to a tool executable associated with arguments
  • The processes must return an exit code of 0 otherwise they are considered to be in failure.
  • Per process logs are recorded in bank log directory (STDOUT/STDERR)
  • Multiple execution environments are available (local, docker, drmaa)
  • The. exe are added in the « PATH ».
  • Any. exe can be called up
  • The processes can be stored in /process

Pre-processes

Example:

  • Help for downloading files protected by identifiers
  • Files recovery in a private bank
  • Checking the disk space
  • Etc.

How to call pre-process? BLOCK1.db.pre.process=META1,META2,META3

Post-processes

An example of a script that concatenates several files into one, and adds the resulting file as a dependency :

workdir=$datadir/$dirversion/$localrelease/flat
cd $workdir;
for file in $@
do
files=$files" "$file
done
cat $files > $workdir/output ;

How to call post-process? BLOCK1.db.post.process=META0

How to create one?

Even though BioMAJ is shipped with some post-processes, you might need to write your owns.

Here are some points that will help you in the process :

  • BioMAJ can run scripts or binaries
  • The executables must be placed in the directory process.dir
  • You have access to several environment variables that are set by BioMAJ. The exhaustive list is available here, but the most useful are :
    • $datadir : root directory for all production directories
    • $dirversion : production directory
    • $offlinedir: temporary directory
    • $localrelease:directory of the release in dir version
    • $remoterelease:version number (available only for postprocess)
    • The current location of the files in the post-process stage can be obtained by : $datadir/$dirversion/$localrelease
    • The downloaded raw data is available in the directory : $datadir/$dirversion/future_release/flat
  • BioMAJ allows basic and advanced interaction with the executable
    • Basic interaction : BioMAJ retrieves the return value at the end of the program. If no value is returned or the value is null, the execution of the program is considered to be successful. If any other value is received, an error is returned and the worklow is stopped.

How to call post process in the bank.properties? (example for the alu bank)

Post processes are defined in conf via BLOCKS property. BLOCKS defines some META processes. Each BLOCK is executed sequentially META processes are executed in parallel. Each META define a list of process to execute sequentially. By default, 2 threads are defined for parallel execution.

## Post Process ## The files should be located in the projectfiles/process directory
BLOCKS=BLOCK1
BLOCK1.db.post.process=META0
META0=test_biomaj

test_biomaj.name=biomaj_test
#test_biomaj.desc=test biomaj
test_biomaj.cluster=false
test_biomaj.type=test
test_biomaj.exe=./<name_of_the_script>.sh
Do not forget to make your process files executable.

Processes properties

Mandatory:

  • Use: PROC0. <option>.
  • PROC0. name: process name
  • PROC0. desc: Process description
  • PROC0. cluster: local execution or on a cluster
  • PROC0. type: script type executed
  • PROC0. exe: executable, shell command… (execution in process. dir or path)
  • PROC0. args: argument needed for executable

Optionnal:

  • PROC0. native: native cluster option (queue…)

How to define meta informations for processes?

  • PROC0. format: process name (blast)
  • PROC0. types: data types (nucleic)
  • PROC0. tags: indicate the data belonging (chr: chr1, organizationm: hg19)
  • PROC0. files: list of generated files (dir1/file1, dir1. file2), the paths are relative to the directory of the release.

Syntax : ##BIOMAJ#format#list_of_types#list_of_key_value_tags#list_of_files

echo "##BIOMAJ#blast#nucleic#organism:hg19,chr:chr1#blast/chr1/chr1db"
echo "##BIOMAJ#blast#nucleic#organism:hg19,chr:chr2#blast/chr2/chr2db"
echo "any text"
echo "##BIOMAJ#fasta#proteic#organism:hg19#fasta/chr1.fa,fasta/chr2.fa" 

How to simplify tools installation with Conda?

Conda is a package management and work environment management system. It makes it easier to install tools/software, especially in bioinformatics. Many tools are available on bioconda (https://bioconda.github.io/recipes.html).

Conda installation

Conda is installed directly with biomaj-docker. It must be installed manually with the monolithic version of biomaj.

How to install conda manually? Here.

How to proceed?

A python script (available here with its wrapper) allows the installation of your package(s) of interest via the biomaj post process. You just have to :

  • Add a special block (in your bank.properties file: example here) to install your package(s) of interest
  • Download the conda wrapper in biomaj/process
  • Download the python script in biomaj/process
  • Give them the execution permissions chmod 755 biomaj/process/*
####################
### Post Process ###
####################  The files should be located in the projectfiles/process directory.


BLOCKS=BLOCK1,BLOCK2
BLOCK1.db.post.process=META0
META0=conda

#wrapper_install_conda.sh + conda_install_multi.py
conda.name=conda
conda.type=install
conda.exe=wrapper_install_conda.sh
conda.args=blast $processdir/packageblast.txt $processdir
conda.cluster=false

BLOCK2.db.post.process=META1
META1=makeblastdb


#makeblastdb.sh
makeblastdb.desc=Index blast
makeblastdb.type=index
makeblastdb.cluster=
makeblastdb.name=makeblastdb
makeblastdb.args="flat/swissprot" "blast/" "-dbtype prot" "swissprot"
makeblastdb.exe=makeblastdb.sh

How to specify the packages to install?

Create the <list>.txt with all the conda packages you want to install (version number are available on bioconda for example here), and place it in your biomaj/process directory:

bwa=0.7.8
blast=2.5.0
bowtie2=2.3.4.1

Then complete the conda.args line of your bank.properties file with your environment name and the name of your <list>.txt file.

conda.args=<environment name> $processdir/<liste of packages>.txt $processdir

How to use your conda environment?

In your post process script, do not forget to activate your environment (in our example, the makeblastdb.sh file):

source activate blast

Do not forget to disable your environment at the end of your script:

deactivate

To delete your environment:

rm -r biomaj/process/blast

More information (how to add metadata, how to add many processes) here.

More examples here.