Bioinformatics, visual programming, “unemployment”

Through Public Rambling, I arrived at this piece of “I don’t know nothing about the history of informatics at all”.

Some decades ago there was this idea in informatics that visual programming languages would make programming useless, programmers jobless and empower users to do everything they needed with a computer.

The funny thing is, we are in 2007 and programming is still one of the best technical careers in most places (from the point of view of employability). The problems with jobs in programming have all to do with outsourcing and nothing to do with users replacing programmers (in fact, (programming) complexity is growing, not shrinking as informatics enters more and more areas of our existence).

People that think visual programming tools will make programmers redundant fail to grasp the basic fundamental insight about manipulating computing systems: You can put a “visual”, “easy to use” interface on top of the computing system, but, at the end of the day people using those systems will have to have in their head fundamental concepts about data structures, algorithms, etc… It doesn’t matter how do you present the system to users/programmers, at the end of the day, what matters, when manipulating computational systems, is the conceptual framework inside your head. Do you have the right concepts or not? If you have them you will program in Python, Perl, Java, Visual Basic, Your_Invented_Thingy, Visual_Programming_Stuff, etc… very easily, if not things will be difficult or even impossible.

Well to be honest how you manipulate the computational system does matter (that is why most people use scripting languages instead of C or Assembler), but history has more or less proven that visual programming environments happen to be some of the worse, less productive that exist (with some small exceptions).

There is one way to make biologists make good use of computers and that is for biologists to learn the basic CS concepts. Like they have to have the basic math/statistical concepts, and that is accepted. I don’t see that happening as most biologists (surely not all) try in as much as possible to learn the least informatics at all.

Don’t think I am ranting (and please accept my apologies for the rant format) because I see my “job” (I am a CS guy) disappear, it is exactly the opposite, while this mentality of “easy to use”, “programming redundant” is around there will be an infinite space for people like me to just produce “easy to use” tools for doing a simple task. Why? Because most people lack the conceptual framework to assemble, from existing bits and pieces, and automatically process information in any trivially novel way and that leaves a lot of space for informatics guys/gals.

The problem is that, until people in biology understand that they need to grasp fundamental computing concepts the field will progress at a much slower pace that it could. I would love to see the day where I would be jobless because people control the computational infrastructure to their needs, but for now what I see is precisely the opposite.

In a self-centered way I am only grateful to the state of things…

PS - That being said, I will come back to the article/interview because I think it really addresses interesting issues in Bioinformatics that deserve to be discussed.

Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Digg
  • del.icio.us
  • Facebook
  • Google
  • connotea
  • DZone
  • Reddit
  • Slashdot
  • StumbleUpon
  • Technorati

Filed in: bioinformatics

by: tiago

2 Comments

Post Doc position: Concurrency and Parallelism in BioInformatics

Professor Cardoso e Cunha was my BSc supervisor, 10 years ago, I can only recommend, based on my previous experience with him, that potential interested individuals give really good consideration to the position below:

Post Doc Positions Open
at CITI- Centre for Informatics and IT - FCT/UNL
http://citi.di.fct.unl.pt

Job/Fellowship Reference: C2007-418-CITI-1
Concurrency and Parallelism in BioInformatics
http://www.eracareers.pt/opportunities/index.aspx?task=global&jobId=6296

Applications are sought for RESEARCHER positions. Successful applicants will engage in the development of computation models, languages and algorithms to exploit Concurrency and Parallelism in BioInformatics, involving work both in the application of computer science techniques for modeling biological, biochemistry and biomedical systems, and in the development of biologically inspired computational models.

Important dimensions of this research include the specification and verification of space-time properties of complex systems, modal logics for concurrency, process calculi, and the development of parallel and distributed computing models and algorithms for BioInformatics, enabling the processing of complex simulations with access to very large data sets. Successful candidates will join the research groups of the CITI Research Centre, and conduct joint research
in the context of interdisciplinary collaborations to exploit the relationships between Computer and Information Sciences and Life Sciences.

More information about the CITI - Centre for Informatics and IT, a research unit of the Departamento de Informatica, Universidade Nova de Lisboa, may be found in the CITI web site http://citi.di.fct.unl.pt

Any candidate must have a post-doctorate research experience of at least 3 years, be knowledgeable in Computer Science in general, preferably in some of the areas mentioned above, and in its relationships with BioInformatics. Successful candidates must have
competence particularly in fields like Concurrency Models, Parallel Algorithms, Principles of Programming Languages and Models. Applicants with a strong Computer Science background will have priority.

Please send detailed CV and two reference letters before 31 August 2007 to:

Prof. Jose C. Cunha
CITI - Centre for Informatics and IT
Faculdade de Ciencias e Tecnologia
Universidade Nova de Lisboa
2829-516 Caparica
Portugal
e-mail: jcc@di.fct.unl.pt
tel: +351 212948536
fax: +351 212948541

Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Digg
  • del.icio.us
  • Facebook
  • Google
  • connotea
  • DZone
  • Reddit
  • Slashdot
  • StumbleUpon
  • Technorati

Filed in: bioinformatics

by: tiago

1 Comment

Bioinformatics, multi-core CPUs and grid computing: developer perspective (3/4)

First, my apologies for this long overdue part on developing life sciences applications considering multi-core CPUs and grid environments.
This text is somewhat long, and most probably will be revised in the future.

Developing for multi-core CPUs and grids borrows a lot (everything really) from concurrent, parallel and distributed programming. I do not intend, at all, to cover all those issues. That would be enough to fill in several books and articles. My objective is much smaller and simpler: to discuss some basic intuitions that can be interesting for people who develop “old style” software.

IMHO, the fundamental intuition is: Stop thinking synchronously, start thinking asynchronously… “What is this guy saying?”, you ask…

From synchronous to asynchronous: An example

Lets imagine that you have an application, computationally intensive, that simply does not get any performance improvement on your shiny new quad-core processor.

That application has a function, lets call it suck_cpu that takes 5 minutes to run, and you need to call it 10 times, each run being independent from previous runs, like this (example pseudo code in Python):

#we have an array input_parameters to be feed to suck_cpu
results = []
for i in range(10):
  results.append(suck_cpu(input_parameters[i])) # Takes 5 minutes to run
#The for loop took 50 minutes to run
for i in range(10):
  print "results from execution", i, "were", results

So, you run suck_cpu and you wait (block) until you have the result. You do this 10 times. For 50 minutes.

An alternative would be to ask to system to start a “task” running suck_cpu and your main program would not wait (not block) until the function ends. You would start 10 instances of suck_cpu, not wait any time (well, wait very little) and get back to your program. What could you do after starting the tasks (and not having the results back for now)? Well, it depends, but you could cater for your user (i.e., not block the user interface), do other tasks that are independent of the results of suck cpu or just wait (while informing the user of progress) for the tasks to end.

Even in the case of just waiting for results you would profit a lot: Your task scheduler could run n tasks in parallel (n being the number of your CPU cores), effectively getting all the computing power of your CPU (the main point of these series of articles).

So, how would your code look like?

#we have an array input_parameters to be feed to suck_cpu
results = []
tasks = []
for i in range(10):
  tasks.append(schedule_tasks(suck_cpu(input_parameters[i]))) # returns "immediately"
#The for loop ran in "no time"

So you would ask the task scheduler to run your suck_cpu tasks. At the end of the for loop your code was in control and the task scheduler was busy running your tasks. You would have the task IDs back (not the results). How was the task scheduler operating? Well, that would depend on its internal workings, but your tasks would probably be in one of 3 possible states:

  1. QUEUED - The task would be ready to run, but not started yet
  2. RUNNING - The task is running
  3. DONE - The task is done and you can collect the results

You could query the task scheduler for the state of your tasks, something like this:

for task in tasks:
  print "Task", task, "is in state", get_task_state(task)

The output, at the very beginning would be something like this:

Task T1 is in state RUNNING
Task T2 is in state RUNNING
Task T3 is in state RUNNING
Task T4 is in state RUNNING
Task T5 is in state QUEUED
Task T6 is in state QUEUED
Task T7 is in state QUEUED
Task T8 is in state QUEUED
Task T9 is in state QUEUED
Task T10 is in state QUEUED

Somewhere in the middle of the execution you would probably find something like this:

Task T1 is in state RUNNING
Task T2 is in state DONE
Task T3 is in state DONE
Task T4 is in state RUNNING
Task T5 is in state RUNNING
Task T6 is in state QUEUED
Task T7 is in state RUNNING
Task T8 is in state QUEUED
Task T9 is in state QUEUED
Task T10 is in state QUEUED

Notice one very important point, task T1 was not the first to finish, although it was the first to be scheduled. Also, T6 started before T5. The fundamental issue here is that execution of concurrent stuff is non-deterministic. That is things don’t normally happen in a clear, predetermined order, there is some degree of randomness even if your code is not stochastic at all. This will play a fundamental role on how things will have to be thought and designed… We will discuss this later.

Notice how a (not very dumb) task scheduler will be using all your CPU cores (ie 4 task are running simultaneously).

Of course, at the end, you expect all tasks to be in state DONE. Now you can collect all the results (by the way, you can collect and process results from individual tasks as soon as each one is ready).

You might be thinking: “woo that is a lot of overhead… now I have to keep track of tasks, check their state and so on…”. Yes, there is some overhead with all this, furthermore you have to change the way you are used to think. But I contend that this way is more natural and more close to the way you think: Tasks that are independent were only run in sequence because the existing programming paradigm forced you to conceive then in sequence, but, in your head, you know that they are independent and you were just forcing your way of thought to be similar to to the computing paradigm.

Well, maybe I am just saying this because of what lies ahead… :( . I am now going to discuss new types of bugs you have to be aware of and non-determinism.

The dark side: resource sharing and debugging

Unfortunately this type of programming paradigm introduces some new kinds of problems that you have to be aware of. I would like to separate them in 2 types of problmems:

  1. Resource sharing: Imagine that your tasks dump your output to a printer, you don’t want them to output to the same printer at the same time (like having parts of the same sheet with results for different tasks, interleaved). Most probably your tasks, while not sharing a printer, will share data structures and access to them will have to be cleverly controlled
  2. Non-determinism: Tasks run independent of each other, task schedulers schedule tasks in unexpected ways. This means that your code will run differently every time you run it, even if it is completely deterministic by itself.

The dark side: An example

I will use here the standard, boring example that is always used in concurrent programming:

Imagine that you have a bank account, your bank account has 50€. You make a deposit of 30€, at the same time somebody cashes in a check from your account (i.e., makes a withdrawal) of 20€. So, the end balance should be 60€.

Now, imagine that the code, at your bank to do this is:

def deposit(account_no, amount):
    before_deposit_balance = get_balance(account_no)
    set_balance(account_no, before_deposit_balance + amount)
 
 
def withdrawal(account_no, amount):
    before_withdrawal_balance = get_balance(account_no)
    if before_withdrawal_balance> amount: #Money has to be there...
        set_balance(account_no, before_withdrawal_balance - amount)

You would expect that these operations run atomically, i.e., one separated from one another, that is, on your account, this would happen:

#Current balance 50 Euros
before_deposit_balance = get_balance(account_no)
set_balance(account_no, before_deposit_balance + amount) #adding 30
#Current balance 80 Euros
before_withdrawal_balance = get_balance(account_no)
if before_withdrawal_balance> amount: #Money has to be there...
    set_balance(account_no, before_withdrawal_balance - amount) #Removing 20
#Final is 60

Another possibility would be the withdrawal to occur before the deposit (Start with 50, down to 30, final 60 again).

But, now imagine that the code runs like this (remember, the tasks might be running concurrently, and they don’t know what other tasks are doing)

#Current balance 50 Euros
before_deposit_balance = get_balance(account_no)
before_withdrawal_balance = get_balance(account_no)
set_balance(account_no, before_deposit_balance + amount) #adding 30
if before_withdrawal_balance> amount: #Money has to be there...
    set_balance(account_no, before_withdrawal_balance - amount) #Removing 20
#Final is 30

That is, both operations read the same amount and they both get 50. Then deposit writes 50+30, making it 80. After that withdrawal writes 50-30 making it 20. You just lost 30€ to the bank!. By the way, it could happen the other way around (you would end up with 80€), as an exercise you might try to imagine how the execution trace would look like for the final result to be 80€.
This is a problem of both resource sharing (i.e., both operations are sharing the same account) and non-determinism (i.e., execution of both operations can happen in a lot of different, interleaved, ways). There are actually many ways to solve this, the most obvious is to lock the account in some way, like this:

def deposit(account_no, amount):
    lock_account(account_no) #Get exclusive access to the account
    before_deposit_balance = get_balance(account_no)
    set_balance(account_no, before_deposit_balance + amount)
    release_account(account_no) #Release exclusive access to the account
 
 
def withdrawal(account_no, amount):
    lock_account(account_no) #Get exclusive access to the account
    before_withdrawal_balance = get_balance(account_no)
    if before_withdrawal_balance> amount: #Money has to be there...
        set_balance(account_no, before_withdrawal_balance - amount)
    release_account(account_no) #Release exclusive access to the account

In this case, when an operation starts, it invokes some magic which guarantees that account access is only granted to it while the operation lasts. If another task tries to lock the same account, that other task blocks until access is possible (this means that you should be careful with locking as you might loose parallelism when locking).

By the way, this introduces problems by itself (e.g., forgetting to release a locked account).

But what drives developers mad is the fact that the execution is non-deterministic, so, you might run your concurrent code 999 times and it works perfectly, but on the 1000th run some strange thing happens (which is very difficult to replicate, as it only happens, in this case, on average 1 every 1000 times).

There would be much more to say about this topic, but I think (hope) this relays the fundamental intuitions to help you start developing programs that make use of multi-core CPUs (more so, help you easily adapt existing software).

Grids

Programming for grids is actually quite similar to multi-core programming. The fundamental issue is that memory sharing is more difficult (takes more time), while in a multi-core computer the memory is shared (so it is easier and faster to share information between concurrent tasks) in a grid memory is not shared. This means that if your tasks need to communicate a lot then, in a grid you can expect time and network bandwidth to be lost during communication between tasks, this is probably the fundamental cause that in grids many algorithms don’t scale linearly with the number of grid nodes (i.e., if you double the number of grid nodes, your performance doesn’t duplicate as you loose with communication overhead). Anyways, lots of communication means lots of horrible bugs to correct, so try to keep communication low ;) .

The final part

In the final (fourth) part of these series I will discuss a real, existing case of Python code to deal with multi-core architectures. This code will be based on examples from population genetics simulators (selection detection and coalescent). Included will be a real example of a bug that I suffered when running a selection detection program in parts, concurrently :( . This part will be ready sometime in August.

Other approaches

While there are many issues that could be discussed, many technicalities, I will stop here. In the future I might discuss such topics as state and problems caused by it in concurrent programming (ie, advantages of functional and logic programming) or language support for concurrent programming (functional programming paradigms, Erlang support, Fortress, …). If I get back to the topic it will be mainly to discuss either linguistic approaches to the issue or anything “modern” (all the rest you can easily read in very good, old, well-established literature)

Part 5 of 4

This 4 part article will have a part 5 ;) . I decided that I want to talk a bit about volunteer computing projects like BOINC.

Article too big or complex? Don’t hesitate to comment, I am pretty sure that this text will need revision and your input is more than welcome.

Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Digg
  • del.icio.us
  • Facebook
  • Google
  • connotea
  • DZone
  • Reddit
  • Slashdot
  • StumbleUpon
  • Technorati

Filed in: bioinformatics, software engineering

by: tiago

5 Comments

Easy to use bioinformatics interfaces (1/2): BACA

Starting a shameless self promotion exercise, I would like to present two easy to use interfaces (with completely different target users and objectives) in the field of Bioinformatics.

The first one is BACA, a multiple mitochrondrial genome retriever, organizer and visualizer. BACA, allows the retrieval of multiple complete mitochondrial genomes. It will split all of them in features (cDNA, origins of replication, tRNA, …) creating FASTA files per genome (ie a FASTA with all components from a single genome) and type of annotation (ie a FASTA file will a certain gene from all genomes downloaded. BACA also provides an SVG visualizer, allowing to visually compare multiple genomes. Its main purpose is to help biologists download and organize mitochondrial data from GenBank, ie, its targeted at users which no nothing about scripting at all.

BACA was published on Molecular Ecology Notes. It is a (ugh!!!) Perl application.

It was developed as a toy in a Phylogenetics course in my MSc and ended up as a publication and public web service… But it is not much more than a useful toy really.

Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Digg
  • del.icio.us
  • Facebook
  • Google
  • connotea
  • DZone
  • Reddit
  • Slashdot
  • StumbleUpon
  • Technorati

Filed in: bioinformatics

by: tiago

1 Comment

Bioinformatics, multi-core CPUs and grid computing: User perspective (2/4)

Although we can expect (wish?) that most bioinformatics applications in the future will support multi-core (and perhaps, grid) computing, currently most applications were not made with multi-core CPUs in mind (although some are already capable of using grids). Here we discuss three different kinds of typical scenarios that users might find. Each scenario is illustrated with at least an example application.

Single-core application to be run multiple independent times

Some applications are sometimes run multiple independent times in order to determine intervals for which certain parameters are expected to fall.

One example are population genetics’ simulators (both coalescent and forward-time) like CoaSim or simuPOP. These simulators are run thousands of times under some demographic scenario in order to determine e.g. where certain intervals for certain statistics (e.g. Fst) fall for neutral (i.e., that are not candidates for selection) markers.

The strategy here is quite simple: To divide the workload in a number of tasks that is equivalent to the number of cores available.

As an example, imagine that you want to run 10.000 population genetics’ simulations using simuPOP and you have a machine with 8 cores. You simply instruct 8 simuPOP instances to run 1.250 simulations each.

There are a few issues that require some care, though:

  1. You should make sure that output directories are different (and possibly input files also)
  2. All the instances running have really to be independent: You have to make sure that random seeds are independent. If a random seed is specified in one of the input files, then you really have to have different input files.
  3. In the end you will have to concatenate in any way all the results.

Programs that are grid-ready

Some programs, like Migrate were designed to be run in a parallel environment, normally MPI (Message Passing Interface). It is very easy to make these programs use multiple cores: Install MPI on a single machine and configure it by saying that the maximum number of processes that can be run locally is equal to the number of calls. Then mpirun your application calling it with a number of processes equal to the number of cores (normally the parameter is -np).

PS - Regarding forward-time simulators like simuPOP, some of these are MPI based, but really, they only allow to parallelize a single simulation, by eg, simulating each population on a different node. If the objective is to run a very long single simulation, the simuPOP falls under the category of grid-ready, but it the objective is to run many simulations than it becomes an example of a serial application that runs multiple independent times (ie, the previous scenario).

Other cases

In reality, in other cases where you only want to run a single instance and the program does not support MPI there is really no way around it, programs like PAUP* come to mind. It is especially bad in cases where the code is closed source, as some programs internally are really running multiple independent runs of something and could be changed to take advantage of multiple cores. Some programs implementing Maximum Likelihood approaches will probably be somewhat easy to parallelize.

Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Digg
  • del.icio.us
  • Facebook
  • Google
  • connotea
  • DZone
  • Reddit
  • Slashdot
  • StumbleUpon
  • Technorati

Filed in: bioinformatics

by: tiago

2 Comments

Bioinformatics, multi-core CPUs and grid computing: Introduction (1/4)

A few years ago, one expectation was implicit when buying a new computer: that performance would increase more or less linearly with CPU clock frequency. That is, if you had a old 450 MHz computer and bought a new one of 900 MHz you would expect that the new one would have more or less twice the speed of the old one.

If you had a computationally intensive bioinformatics application, you would then expect that it would run at twice the speed. So if it took 4 weeks to complete a certain task, it would now complete in only 2 weeks. Nice!

For technical reasons (limitations is a better word), since a few years ago CPU manufacturers like Intel, AMD and IBM resort to not increasing the CPU frequency but to provide more cores. That is, when you have a dual core CPU with the same CPU frequency(*), a single task will still take the same time to complete, but you could do two tasks simultaneously and they would not steal CPU time from one another.

So:

Imagine that you have 2 tasks of four weeks each (on a single core 1 GHz computer):

On that machine the whole operation would take 8 weeks (2 tasks * 4 weeks / 1 core) weeks.

On a single core 2 GHz machine, the whole operation would take 4 weeks (2*2/1) weeks.

On a dual core 1 GHz computer the whole operation would take 4 weeks (2*4/2). But here is the catch: You would have to start them at the same time, concurrently. Because if you started one after the other it would take 8 weeks (and you would be half of your processing power - one of the cores would be doing nothing).

It could be expected that the increase in computing power, in the future will be a lot like this: multi-cores with more than one CPU, but each core speed staying more or less the same.

So, if you have a brand new machine and your computationally intensive application still takes ages to run, this might be the cause.

In a series of 4 blog posts I will discuss the consequences of this change for both users and developers. The parts will be:

  1. Introduction - You are reading it.
  2. Consequences to users - I will discuss the consequences of this paradigm shift in a user perspective. I will present some real situations (using existing bio software today) and suggest strategies to take the most performance out new hardware. Scenarios range from complete frustration (ie, only a single core can be used, thus there will be no noticeable performance gain) to total gain (ie, there will be linear gains with the number of new cores introduced).
  3. Consequences to developers: Design and concepts - I will discuss the changes that bio software developers will have to consider in order to make their applications multi-core aware and thus, use all the performance available on new machines. The key words here will be: asynchronous calling models, concurrent programming, memory sharing, message passing.
  4. Consequences to developers: One practical example - I will present a framework, in Python, to facilitate the development of multi-core aware applications.

In each post I will also discuss grid computing issues briefly, as taking advantage of grids is sometimes similar to multi-core performance gains.

As during next week I will be traveling, the next post should only surface around Monday, 15th. Please accept my apologies in advance for this delay.

(*)CPU frequency is really an erroneous simplification, but for the sake of simplicity I use it.

Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Digg
  • del.icio.us
  • Facebook
  • Google
  • connotea
  • DZone
  • Reddit
  • Slashdot
  • StumbleUpon
  • Technorati

Filed in: bioinformatics

by: tiago

1 Comment