What is a process?
A process in BioMAJ corresponds to apply some transformations (blast indexing, emboss indexing,… )to the data bank.
- The process manager handles their execution sequentially or in parallel
- The processes correspond to a tool executable associated with arguments
- The processes must return an exit code of 0 otherwise they are considered to be in failure.
- Per process logs are recorded in bank log directory (STDOUT/STDERR)
- Multiple execution environments are available (local, docker, drmaa)
- The. exe are added in the « PATH ».
- Any. exe can be called up
- The processes can be stored in /process
- Help for downloading files protected by identifiers
- Files recovery in a private bank
- Checking the disk space
How to call pre-process? BLOCK1.db.pre.process=META1,META2,META3
An example of a script that concatenates several files into one, and adds the resulting file as a dependency :
workdir=$datadir/$dirversion/$localrelease/flat cd $workdir; for file in $@ do files=$files" "$file done cat $files > $workdir/output ;
How to call post-process? BLOCK1.db.post.process=META0
How to create one?
Even though BioMAJ is shipped with some post-processes, you might need to write your owns.
Here are some points that will help you in the process :
- BioMAJ can run scripts or binaries
- The executables must be placed in the directory
- You have access to several environment variables that are set by BioMAJ. The exhaustive list is available here, but the most useful are :
$datadir: root directory for all production directories
$dirversion: production directory
$offlinedir: temporary directory
$localrelease:directory of the release in dir version
$remoterelease:version number (available only for postprocess)
- The current location of the files in the post-process stage can be obtained by :
- The downloaded raw data is available in the directory :
- BioMAJ allows basic and advanced interaction with the executable
- Basic interaction : BioMAJ retrieves the return value at the end of the program. If no value is returned or the value is null, the execution of the program is considered to be successful. If any other value is received, an error is returned and the worklow is stopped.
How to call post process in the bank.properties? (example for the alu bank)
Post processes are defined in conf via BLOCKS property. BLOCKS defines some META processes. Each BLOCK is executed sequentially META processes are executed in parallel. Each META define a list of process to execute sequentially. By default, 2 threads are defined for parallel execution.
## Post Process ## The files should be located in the projectfiles/process directory BLOCKS=BLOCK1 BLOCK1.db.post.process=META0 META0=test_biomaj test_biomaj.name=biomaj_test #test_biomaj.desc=test biomaj test_biomaj.cluster=false test_biomaj.type=test test_biomaj.exe=./<name_of_the_script>.sh
- Use: PROC0. <option>.
- PROC0. name: process name
- PROC0. desc: Process description
- PROC0. cluster: local execution or on a cluster
- PROC0. type: script type executed
- PROC0. exe: executable, shell command… (execution in process. dir or path)
- PROC0. args: argument needed for executable
- PROC0. native: native cluster option (queue…)
How to define meta informations for processes?
- PROC0. format: process name (blast)
- PROC0. types: data types (nucleic)
- PROC0. tags: indicate the data belonging (chr: chr1, organizationm: hg19)
- PROC0. files: list of generated files (dir1/file1, dir1. file2), the paths are relative to the directory of the release.
Syntax : ##BIOMAJ#format#list_of_types#list_of_key_value_tags#list_of_files
echo "##BIOMAJ#blast#nucleic#organism:hg19,chr:chr1#blast/chr1/chr1db" echo "##BIOMAJ#blast#nucleic#organism:hg19,chr:chr2#blast/chr2/chr2db" echo "any text" echo "##BIOMAJ#fasta#proteic#organism:hg19#fasta/chr1.fa,fasta/chr2.fa"
How to simplify tools installation with Conda?
Conda is a package management and work environment management system. It makes it easier to install tools/software, especially in bioinformatics. Many tools are available on bioconda (https://bioconda.github.io/recipes.html).
Conda is installed directly with biomaj-docker. It must be installed manually with the monolithic version of biomaj.
How to install conda manually? Here.
How to proceed?
- Add a special block (in your bank.properties file: example here) to install your package(s) of interest
- Download the conda wrapper in biomaj/process
- Download the python script in biomaj/process
- Give them the execution permissions chmod 755 biomaj/process/*
#################### ### Post Process ### #################### The files should be located in the projectfiles/process directory. BLOCKS=BLOCK1,BLOCK2 BLOCK1.db.post.process=META0 META0=conda #wrapper_install_conda.sh + conda_install_multi.py conda.name=conda conda.type=install conda.exe=wrapper_install_conda.sh conda.args=blast $processdir/packageblast.txt $processdir conda.cluster=false BLOCK2.db.post.process=META1 META1=makeblastdb #makeblastdb.sh makeblastdb.desc=Index blast makeblastdb.type=index makeblastdb.cluster= makeblastdb.name=makeblastdb makeblastdb.args="flat/swissprot" "blast/" "-dbtype prot" "swissprot" makeblastdb.exe=makeblastdb.sh
How to specify the packages to install?
Create the <list>.txt with all the conda packages you want to install (version number are available on bioconda for example here), and place it in your biomaj/process directory:
bwa=0.7.8 blast=2.5.0 bowtie2=18.104.22.168
Then complete the conda.args line of your bank.properties file with your environment name and the name of your <list>.txt file.
conda.args=<environment name> $processdir/<liste of packages>.txt $processdir
How to use your conda environment?
In your post process script, do not forget to activate your environment (in our example, the makeblastdb.sh file):
source activate blast
Do not forget to disable your environment at the end of your script:
To delete your environment:
rm -r biomaj/process/blast