Although we can expect (wish?) that most bioinformatics applications in the future will support multi-core (and perhaps, grid) computing, currently most applications were not made with multi-core CPUs in mind (although some are already capable of using grids). Here we discuss three different kinds of typical scenarios that users might find. Each scenario is illustrated with at least an example application.

Single-core application to be run multiple independent times

Some applications are sometimes run multiple independent times in order to determine intervals for which certain parameters are expected to fall.

One example are population genetics’ simulators (both coalescent and forward-time) like CoaSim or simuPOP. These simulators are run thousands of times under some demographic scenario in order to determine e.g. where certain intervals for certain statistics (e.g. Fst) fall for neutral (i.e., that are not candidates for selection) markers.

The strategy here is quite simple: To divide the workload in a number of tasks that is equivalent to the number of cores available.

As an example, imagine that you want to run 10.000 population genetics’ simulations using simuPOP and you have a machine with 8 cores. You simply instruct 8 simuPOP instances to run 1.250 simulations each.

There are a few issues that require some care, though:

  1. You should make sure that output directories are different (and possibly input files also)
  2. All the instances running have really to be independent: You have to make sure that random seeds are independent. If a random seed is specified in one of the input files, then you really have to have different input files.
  3. In the end you will have to concatenate in any way all the results.

Programs that are grid-ready

Some programs, like Migrate were designed to be run in a parallel environment, normally MPI (Message Passing Interface). It is very easy to make these programs use multiple cores: Install MPI on a single machine and configure it by saying that the maximum number of processes that can be run locally is equal to the number of calls. Then mpirun your application calling it with a number of processes equal to the number of cores (normally the parameter is -np).

PS - Regarding forward-time simulators like simuPOP, some of these are MPI based, but really, they only allow to parallelize a single simulation, by eg, simulating each population on a different node. If the objective is to run a very long single simulation, the simuPOP falls under the category of grid-ready, but it the objective is to run many simulations than it becomes an example of a serial application that runs multiple independent times (ie, the previous scenario).

Other cases

In reality, in other cases where you only want to run a single instance and the program does not support MPI there is really no way around it, programs like PAUP* come to mind. It is especially bad in cases where the code is closed source, as some programs internally are really running multiple independent runs of something and could be changed to take advantage of multiple cores. Some programs implementing Maximum Likelihood approaches will probably be somewhat easy to parallelize.

Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Digg
  • del.icio.us
  • Facebook
  • Google
  • connotea
  • DZone
  • Reddit
  • Slashdot
  • StumbleUpon
  • Technorati

2 Comments to "Bioinformatics, multi-core CPUs and grid computing: User perspective (2/4)"

  • Bo Peng said:

    As the author of simuPOP, I have basically given up the MPI version of simuPOP and I am actively researching openMP. The reasons are:

    1. It is difficult to MPIify a scripting language (but doable).
    2. The mating process is hard to parallelize. It is doable but the performance gain is questionable.
    3. The major reason for a MPI version is for simulating huge populations, but this is not a bug problem with 64bit computers, and 4G ram as standard (highend) configuration for 32bit machines.
    4. OpenMP can make better use of multi-core, mutlti-thread CPUs than MPI.

    We can discuss this further in the simuPOP mailing list.

  • Perfect Storm » Blog Archive » Groovy/Scala/Ruby/Python on JVM said:

    […] Bioinformatics, multi-core CPUs and grid computing: User perspective (2/4) […]

Please share your thoughts