Batch deployment

Installation of Dynpart

Dynpart is at the moment available on INDIGO-1 repository (only for CentOS) at location. Very soon we will have the INDIGO-2 release

Install the LSF side package:

FROM RPM

You must have the epel repository enabled:

$ yum install epel-release

Then you have to enable the INDIGO - DataCloud packages repositories. See full instructions here. Briefly you have to download the repo file from INDIGO SW Repository in your /etc/yum.repos.d folder.

$ cd /etc/yum.repos.d
$ wget http://repo.indigo-datacloud.eu/repos/2/indigo2.repo

Finally install the dynpart package.

$ yum install python-lsf-dynpart-partition-director

Updating Dynpart

Install the LSF side package:

FROM RPM

For updating from the INDIGO-1 release

Make sure you have added the INDIGO package repository to you package sources. The package repository can be found at the INDIGO SW Repository

Update the Dynpart package

$ sudo yum update python-lsf-dynpart-partition-director

On the LSF master, installing this package basically create and deploy following directories and files:

mkdir -p $LSF_TOP/var/tmp/cloudside/
mkdir -p $LSF_TOP/var/tmp/batchside/
mkdir -p $LSF_TOP/conf/scripts/dynpart/
cp dynp.conf dynp.conf.template elim.dynp esub.dynp bjobs_r.py LSF_TOP/conf/scripts/dynpart/
cp farm.json $LSF_TOP/var/tmp/cloudside/
/usr/bin/adjust_lsf_shares.py
/usr/bin/submitter_demo.py

IMPORTANT NOTE :

Please create this link according to the variable $LSF_SERVERDIR which depends on LSF installation.

ln -s $LSF_TOP/conf/scripts/dynpart/elim.dynp $LSF_SERVERDIR/elim.dynp
ln -s $LSF_TOP/conf/scripts/dynpart/esub.dynp $LSF_SERVERDIR/esub.dynp

and check if this link exists if not then create the one:

ln -s   $LSF_TOP/conf/scripts/dynpart/dynp.conf /etc/indigo/dynpart/dynp.conf

Retrieving the list of running batch jobs

The list of running jobs on each batch host can be achieved in two alternative ways:

  1. Compiling the C program :

The mcjobs_r.c C program queries LSF through its APIs to retrieve the list of running jobs on each host. Pre compiled binary cannot be distributed due to licensing constraints, thus it must be compiled locally. Following is an example compile command, on LSF9.1; please adapt to your specific setup.

cd $LSF_TOP/conf/scripts/dynpart/
gcc mcjobs_r.c -I/usr/share/lsf/9.1/include/ /usr/share/lsf/9.1/linux2.6-glibc2.3-x86_64/lib/libbat.a /usr/share/lsf/9.1/linux2.6-glibc2.3-x86_64/lib/liblsf.a -lm -lnsl -ldl -o mcjobs_r
  1. Python script :

Alternative to compiling the mcjobs_r.c is the bjobs_r.py script which produces the same result. It uses the batch command 'bjobs' to retrieve the number of running jobs on a given host.

Edit LSF configuration files

  • In /usr/share/lsf/conf/lsf.cluster.<clustername> file check the host section.

    In the Host section specify usage of the dynp elim on each WN participating in the dynamic partitioning. Following is an example Host section:

Begin   Host
HOSTNAME  model    type server r1m  mem  swp  RESOURCES    #Keywords
#lsf master
lsf9test   !   !   1   3.5   ()   ()   (mg)
#Cloud Controller for Dynamic Partitioning
t1-cloudcc-02   ! ! 1 3.5 () () (mg)
wn-206-01-01-01-b ! ! 1 3.5 () () (dynp)
wn-206-01-01-02-b ! ! 1 3.5 () () (dynp)

End     Host
  • Define the dynp External Load Index In the Resource Section of lsf.shared:

Begin Resource
RESOURCENAME  TYPE    INTERVAL INCREASING  DESCRIPTION        # Keywords

   dynp    Numeric 60      Y        (dynpart: 1 batch, 2 cloud)

[....]
End Resource
  • Declare use of the custom ESUB method. Add the following in lsf.conf:

LSB_ESUB_METHOD="dynp"

Note: The provided esub.dynp assumes that no other esub method is in place. If so, you must adapt it to your specific case.

  • Verify LSF configuration is ok using the command:

lsadmin ckconfig

If everything is ok (no errors found) reconfigure and restart lim on all nodes in the cluster:

[root@lsf9test ~]# lsadmin reconfig

Checking configuration files ... No errors found.

Restart only the master candidate hosts? [y/n] n

Do you really want to restart LIMs on all hosts? [y/n] y

Restart LIM on ...... done

Restart LIM on ...... done

Restart LIM on ...... done

Restart LIM on ...... done

Note: you can only manually restart lim on a subset of nodes, if needed. For example, if you configure dynp for more nodes in lsf.cluster. cluster and want to make them partition aware you can restrict limrestart to those nodes only.

Next, restart the Master Batch Daemon

[root@lsf9test ~]# badmin mbdrestart

Little after the header output line of the lsload -l command will display the new External Load Information value dynp.

After some time (limrestart takes several minutes to take effect, even on a small cluster) the value 1 should be reported by each node configured to play dynp. other cluster members would display a dash.

Dynp main component on LSF side:

  • elim.dynp

This is a custom External Load Information Manager, specific to LSF,created to enable implementation of the Functionalities and conformant to the LSF guidelines. It assumes to be properly configured at batch system side.

To simulate the concurrent activity following tool is provided:

  • submitter_demo.py

    keep submitting jobs to a specified queue

Last updated