What is a process?
A process in BioMAJ corresponds to apply some transformations (blast indexing, emboss indexing,… )to the data bank.
- The process manager handles their execution sequentially or in parallel
- The processes correspond to a tool executable associated with arguments
- The processes must return an exit code of 0 otherwise they are considered to be in failure.
- Per process logs are recorded in bank log directory (STDOUT/STDERR)
- Multiple execution environments are available (local, docker, drmaa)
- The. exe are added in the « PATH ».
- Any. exe can be called up
- The processes can be stored in /process
- Help for downloading files protected by identifiers
- Files recovery in a private bank
- Checking the disk space
How to call pre-process? BLOCK1.db.pre.process=META1,META2,META3
An example of a script that concatenates several files into one, and adds the resulting file as a dependency :
workdir=$datadir/$dirversion/$localrelease/flat cd $workdir; for file in $@ do files=$files" "$file done cat $files > $workdir/output ;
How to call post-process? BLOCK1.db.post.process=META0
How to create one?
Even though BioMAJ is shipped with some post-processes, you might need to write your owns.
Here are some points that will help you in the process :
- BioMAJ can run scripts or binaries
- The executables must be placed in the directory
- You have access to several environment variables that are set by BioMAJ. The exhaustive list is available here, but the most useful are :
$datadir: root directory for all production directories
$dirversion: production directory
$offlinedir: temporary directory
$localrelease:directory of the release in dir version
$remoterelease:version number (available only for postprocess)
- The current location of the files in the post-process stage can be obtained by :
- The downloaded raw data is available in the directory :
- BioMAJ allows basic and advanced interaction with the executable
- Basic interaction : BioMAJ retrieves the return value at the end of the program. If no value is returned or the value is null, the execution of the program is considered to be successful. If any other value is received, an error is returned and the worklow is stopped.
How to call post process in the bank.properties? (example for the alu bank)
Post processes are defined in conf via BLOCKS property. BLOCKS defines some META processes. Each BLOCK is executed sequentially META processes are executed in parallel. Each META define a list of process to execute sequentially. By default, 2 threads are defined for parallel execution.
## Post Process ## The files should be located in the projectfiles/process directory BLOCKS=BLOCK1 BLOCK1.db.post.process=META0 META0=test_biomaj test_biomaj.name=biomaj_test #test_biomaj.desc=test biomaj test_biomaj.cluster=false test_biomaj.type=test test_biomaj.exe=./<name_of_the_script>.sh
- Use: PROC0. <option>.
- PROC0. name: process name
- PROC0. desc: Process description
- PROC0. cluster: local execution or on a cluster
- PROC0. type: script type executed
- PROC0. exe: executable, shell command… (execution in process. dir or path)
- PROC0. args: argument needed for executable
- PROC0. native: native cluster option (queue…)
How to define meta informations for processes?
- PROC0. format: process name (blast)
- PROC0. types: data types (nucleic)
- PROC0. tags: indicate the data belonging (chr: chr1, organizationm: hg19)
- PROC0. files: list of generated files (dir1/file1, dir1. file2), the paths are relative to the directory of the release.
Syntax : ##BIOMAJ#format#list_of_types#list_of_key_value_tags#list_of_files
echo "##BIOMAJ#blast#nucleic#organism:hg19,chr:chr1#blast/chr1/chr1db" echo "##BIOMAJ#blast#nucleic#organism:hg19,chr:chr2#blast/chr2/chr2db" echo "any text" echo "##BIOMAJ#fasta#proteic#organism:hg19#fasta/chr1.fa,fasta/chr2.fa"
More information (how to add metadata, how to add many processes) here.
More examples here.