Need help?

Checkout our forum https://groups.google.com/forum/#!forum/biomaj

Some community db property files and post processes can be found at https://github.com/genouest/biomaj

Get involved and share your data banks and processes with the community. Simply fork the repository, make your updates and a pull request.

Bank properties

A bank for BioMAJ mainly consists in several properties defined in a .properties file. Although access to these properties has been made easier with BmajWatcher, they can be confusing for a new user.

The most useful properties will be presented here with a concrete example to clear things up.

Let's say we want to download uniprot bank from here:

 

We need to tell BioMAJ :

    • This is done thanks to 4 properties : 
    • server : Self explanatory : ftp.ebi.ac.uk
    • remote.dir : path to directory that contains the required files, note that the '/' at the beginning is mandatory : /pub/databases/uniprot/current_release/knowledgebase/complete/
    • remote.files : contains one or more regular expressions that describe the files to retrieve. Let's say we want the following files : 
      • All uniprot files : ^uniprot.*
      • reldate.txt : reldate.txt
      • All README files : ^README.*
      Final value is : ^uniprot.* reldate.txt ^README.*
    • protocol : tells what protocol to use to download the file. The protocol must not appear in the server property. In our case it is ftp.
  1.  
    • 3 properties are involved with this: 
    • data.dir : this is the root directory where all banks are downloaded : /db/
    • dir.version : this property is optional, it tells where the bank will be downloaded under data.dir. The default value is the bank name : uniprot
    • offline.dir.name : another optional properties that tells the temporary directory name. Files are first downloaded in that directory, then moved to dir.version. Default value is <bankname>_tmp: uniprot_tmp
    • An additional property related to your repository management is keep.old.version. It basically tells how many versions of your bank you want BioMAJ to keep. If value is 1, BioMAJ will keep the current version and the previous one.
  2.  
    • local.files : Once files have been downloaded and extracted (if needed) in data.dir/offline.dir.name directory, they are moved to the production directory (data.dir/dir.version). You can tell BioMAJ to move only some of these files with that property. As for remote.files property, the value is a regular expression. In our case let's say we want to move everything : .*
    Some properties that describe your bank : 
  • db.name : this property holds the name of the bank. uniprot in our case
  • db.fullname : the field usually contains a description about the bank, for example : Some description...
  • db.formats : For informative and classification purpose only, you can specify the data formats of your bank : fasta,blast,xml,xsd,swissprot,emboss It will allow you to filter the bank list in the web interface or via the REST API.
  • db.type : As for db.formats, you can specify the bank type. In our case it's proteic. Only one type can be specified for a bank, but you can structure it hierarchically with slashes, for example genomic/eucaryotic. It is used in the web interface to display the different types as a tree.
  • bank.num.threads : Tells how many banks can be updated in parallel. Useful for batch updates : biomaj --update bank1 bank2 bank3 bank4.... We can set the value to 2.
  • files.num.threads : Tells how many files BioMAJ can download in parallel for a bank. Let's say 4.
  • log.files : BioMAJ logs in his database every downloaded file. When an update process is run, it verifies that none of these have been removed. If this property is activated, BioMAJ will also handle archives and verify that all the extracted files are present. Set this property to true to activate it.
  • release.dateformat : Date format used to build to version : yyyy-MM-dd
  • frequency.update : Usually set to 0, this property holds the value in days BioMAJ has to wait before each update. For example, if value is 15, the bank will be updated at most every 15 days.

So far, our file looks like :

db.name=uniprot db.fullname=Some description... db.formats=fasta,blast,xml,xsd,swissprot,emboss db.type=proteic server=ftp.ebi.ac.uk protocol=ftp remote.dir=/pub/databases/uniprot/current_release/knowledgebase/complete/ remote.files=^uniprot.*$ ^reldate.txt$ ^README.*$ local.files=.* frequency.update=0 data.dir=/db/ offline.dir.name=uniprot_tmp dir.version=uniprot files.num.threads=4 bank.num.threads=2 release.dateformat=yyyy-MM-dd log.files=true keep.old.version=1
 
  1. As we have not defined anything specific on how BioMAJ has to get the bank version, the default behaviour is to set as the release number the most recent file date on the server. In our case, that would be 2011-02-08 (format as defined in release.dateformat). There are two other ways to do differently:
    • From a specific file content :
      • release.file : that property tells what file on the remote server to look into : reldate.txt
      • release.regexp : The value of this property is a regular expression that describes the string that must be considered as the release : \d+\p{Punct}+\d*
      • Back to our example, if we consider reldate.txt content :
        UniProt Knowledgebase Release 2011_02 consists of: UniProtKB/Swiss-Prot Release 2011_02 of 08-Feb-2011 UniProtKB/TrEMBL Release 2011_02 of 08-Feb-2011
        The release would be 2011_02.
    • From a file name : The difference with above is that you must not specify release.file. If you put between parenthesis the regular expression that corresponds to the version, BioMAJ will automatically try to extract it from each of the remote.files names. If for example, we have on a remote server a file named version_file_12.3.txt, the following value for release.regexp would return 12.3 : ^version_file_(\p{Digit}+\.p{Digit}+).*
  2. The file now looks like :

     
    db.name=uniprot db.fullname=Some description... db.formats=fasta,blast,xml,xsd,swissprot,emboss db.type=proteic server=ftp.ebi.ac.uk protocol=ftp remote.dir=/pub/databases/uniprot/current_release/knowledgebase/complete/ remote.files=^uniprot.*$ ^reldate.txt$ ^README.*$ local.files=.* release.file=reldate.txt release.regexp=\d+\p{Punct}+\d* frequency.update=0 data.dir=/db/ offline.dir.name=uniprot_tmp dir.version=uniprot files.num.threads=4 banks.num.threads=2 release.dateformat=yyyy-MM-dd log.files=true keep.old.version=1
     
    • Logging level : You can change the logging level with the property historic.logfile.level with the following values : ERR, WARN, INFO, VERBOSE, DEBUG
    • Mailing : BioMAJ can mail a report after each bank update. The subject of the mail contains 4 items :
      • The bank name
      • The workflow status STATUS[TRUE|FALSE] : TRUE means that everything went well (at least on BioMAJ point view), FALSE means that the process failed.
      • The update status UPDATE[TRUE|FALSE] : TRUE means that a new version was found on the remote server. Note that it does not mean that the update succeeded, just that BioMAJ downloaded/will have to download new files. FALSE means that we already have the latest version.
      • If UPDATE is TRUE, an item information is the version found on the remote server.
      To activate mail reporting, you have to fill in the following properties:
      • mail.from : mail address of the sender
      • mail.smtp.host : smtp server address
      • mail.admin : list of mail addresses separated by commas the reports will be sent to.

We have pretty much covered all the most useful properties, but things can be done regarding the properties organization. You might have noticed that some properties such as data.dir, banks.num.threads can be common to all the banks. You can declare such properties in a special file named global.properties. That file holds any property that you want all your banks to inherit (global.properties shipped with BioMAJ contains most of these properties). BioMAJ can also handle another level of inclusion for properties common to only a few banks. These common properties are filled in a separate file that must be explicitly included in the appropriate banks with a special propety : include.properties For example, include.properties=file1.properties,file2.properties.

Writing a pre/post process

Even though BioMAJ is shipped with some post-processes, you might need to write your owns.

Here are some points that will help you in the process :

  • BioMAJ can run scripts or binaries
  • The executables must be placed in the directory process.dir
  • You have access to several environment variables that are set by BioMAJ. The exhaustive list is available in the user guide, but the most useful are :
    • $datadir : root directory for all production directories
    • $dirversion : production directory
    • The current location of the files in the post-process stage can be obtained by : $datadir/$dirversion/future_release
    • The downloaded raw data is available in the directory : $datadir/$dirversion/future_release/flat
  • BioMAJ allows basic and advanced interaction with the executable
    • Basic interaction : BioMAJ retrieves the return value at the end of the program. If no value is returned or the value is null, the execution of the program is considered to be successful. If any other value is received, an error is returned and the worklow is stopped.

An example of a script that concatenates several files into one, and adds the resulting file as a dependency :

workdir=$datadir/$dirversion/future_release/flat
cd $workdir;
for file in $@
do
files=$files" "$file
done
cat $files > $workdir/output ;