Holy Grail: The quest for THE programming language

Being a computer scientist with a strong interest in languages (languages in the broadest sense possible: programming, natural and cognition related issues), I am in an holy grail quest for a programming language that:

First and foremost allows me to express my computations in a way that is close to the problem domain (as opposed to close to the machine). As I am working in a biology setting that means being able to talk about concepts around genes, epidemics and pharmacology in my programs. I don’t want to think about CPUs, memories and things like that when I am coding. Prolog and Lisp are good examples here. I also need programs that can evolve over time as knowledge changes, I need strong metaprogramming and Domain Specific Language facilities.

Unfortunately I have a couple more requirements coming from the day to day reality…

Real world: I want a language that interacts with existing libraries and that I can easily make available to other people to use, inspect and change. I need Bio* libraries, graphics plotting libraries. I my personal case I decided that I want to work inside the JVM, so I need a language that works in the Java world (Jython, JRuby, Scala, Groovy, … Java).

Software engineering: Programs have to be easy to maintain and debug. I guess there is no way around explicit typing on the debug and tool construction front.

Ridiculous religious fanatic quest? Yes, it might be, but I am pursing it.

The truth is that we are not far away from this grail.

Scala is almost there. Lacks metaprogramming and things like type inference are a bit amateurish (compare it with CAML).

JRuby is maybe there, I could live with it, I guess. The lack of explicit typing will make things difficult in the long run on the software engineering front.

I decided to give a final try to yet another language: Groovy, and up to now it is going very OK. Seems to nail all the fundamental points. I especially love the effort on good metaprogramming facilities.

I decided, for pragmatic reasons, that after this one I will stop my pursuit for the grail. If Groovy proves a blunder of some sorts I will revert to JRuby and carry on.

Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Digg
  • del.icio.us
  • Facebook
  • Google
  • connotea
  • DZone
  • Reddit
  • Slashdot
  • StumbleUpon
  • Technorati

Filed in: bioinformatics, declarative programming, groovy, metaprogramming, science, software engineering

by: tiago

6 Comments

Automated GUIs for OO models and DSLs

One of the most delightful things in bioinformatics is the possibility of working with people with really different mindsets. Surely CS geeks are amazing, and everyday I feel that my original background is really a comparative advantage, but, from where I look, nothing beats being in an environment with scientific and cultural diversity. But, lets talk some geekiness now:

A couple of years ago, I did a population genetics simulator in Caml. It was really flexible, allowing for many demographic and genomic scenarios, mating rules, selection… really flexible. I never got to try to publish it because there are many good simulators around (I suggest simuPOP, if you are looking for one) and it would take some time to make it robust and documented for public exposure. But, the interesting part is, when I went to my MSc supervisor (an “old-type” biologist) and after a very exuberant explanation on how flexible the simulator was, he added only one comment: That is all very well and good, but you did not show me the easy to use graphical interface!

Fast forward a couple of years… With regards to a DSL to model drug resistance in the context of infectious diseases that I am developing, I went to my PhD advisor (a population geneticist, malarialogist, biostatistician who knows how to program in C), showed him my rough prototype and he said: People will be able to read this, but, to interact they will want an easy to use graphical user interface. To be honest, this time, I was expecting the comment (I am living in the middle of experimentalists long enough to have learned something). I have no expectations, for my DSL, that domain specialists will write it (well, maybe a couple of them will, if things pick up). If I end up giving my system away to domain specialists, it will have to have a easy to use interface, there is no escaping from that.

Well, DSLs (at least in Scala and in Ruby) have an underlying OO model. Which, most of the times is neither complex nor big. I am starting to suspect that it won’t be too difficult to automatically generate an easy to use interface to input in a “nice” way what could be rendered as DSL programs (or object instances and relationships, if you prefer to look at it that way). For embedded DSLs, which have the whole expressive power of the host language available, that would be unfeasible to do completely. But, at least part of it could be automated. Obviously this idea is not new at all, this is just a rehash of what Lift or Rails do for databases.

I am aware that graphical programming languages never went too far (I actually dislike them), but the scope and context here are completely different, different premises apply. This might be one way of lowering the barrier to rigorous modeling to a wider crowd.

Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Digg
  • del.icio.us
  • Facebook
  • Google
  • connotea
  • DZone
  • Reddit
  • Slashdot
  • StumbleUpon
  • Technorati

Filed in: Caml, Ruby, Scala, bioinformatics, declarative programming, science, software engineering

by: tiago

No Comments

Software engineering radio

I am currently strongly considering Scala as my main development language.

In the process of getting acquainted with the language I came across this interview with Martin Odersky, one of the creators of Scala (he was also involved in the design of Java generics - although that doesn’t count as a cookie point in favor - I will return to Java generics soon, and give a simple example why I think they suck).

Anyway, this post is not about Scala (I will surely come back to it, as I think it is a language worthwhile considering) but about Software Engineering Radio. I have only listened to a few interviews, but the general level of the guests is high. Furthermore the interviewer seems to be highly prepared and really does insightful questions and raises interesting points. This is certainly not mass media journalism. This is first class work.

If you are interested in software engineering content, check Software Engineering Radio and try to listen to podcasts of topics that might interest you.

As another outstanding example of an high quality interview is this one with Laurence Tratt on Compile-time metaprogramming. A treat for anyone interested in metaprogramming and DSLs.

Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Digg
  • del.icio.us
  • Facebook
  • Google
  • connotea
  • DZone
  • Reddit
  • Slashdot
  • StumbleUpon
  • Technorati

Filed in: software engineering

by: tiago

1 Comment

Patents: apples and potatoes

First of all, I would like to apologize for a long time of silence. I am in a middle of a turmoil here:
Stopped working in a conservation genetics research group in Portugal, helped co-organize a conference in conservation genetics in Montana, USA (20 hours flight in each direction :( ), and tomorrow I am flying to Liverpool to start fresh new work in Malaria. In the middle I was sick with some sort of food poisoning, so…

Hopefully things will get much quieter and stable. Being able to work in poverty diseases is somewhat of a dream of mine and I am very happy with the prospects…

In the mean time, I would like to answer to this post by Deepak Singh. I think he more or less gets it all wrong ;) .

There is one a priori issue with patents (which I call “apples and potatoes”) which stems from the fact that patents are both discussed in the context of new drugs and software (in bioinformatics that happens a lot). The main problem is that they are completely different kinds of problems and, as such, one cannot “transport” (consciously or, more commonly, unconsciously) the reasoning that is done in one domain to the other. Here I will discuss software only, especially because it is the domain that I understand the best.

First, a minor point, the notion of “trade secret”. There is a simple solution to make a certain algorithm a “trade secret”: closed source. Yes, reverse engineering is possible but it is very uncommon these days, and why it is very uncommon? Because there is no such thing as “sheer genius” in creating new algorithms (and that will be my main point of disagreement).

This might sound shocking but creating new algorithms in CS is intellectually cheap. When somebody releases (closed source) software with a new cool thing, there is no need to look at the code: by just seeing the behaviour it is, in the vast majority of the times, enough to devise an algorithm (which might be or not the same) to do the same thing.

Another example: A couple of years ago I remember James Gosling (Java’s father) talking about storing the source code of a program using some sort of abstract syntax tree notation, this allows for very sophisticated things to be done in programming environments. When I read that I remembered a very clever colleague of mine having the same idea a few years ago (not to say there is prior art, for instance in Computer Associate’s Gen product). Ideas (algorithms) are cheap.

Take Google for instance: Fundamental parts of the search engine technology are just taking ideas that were not feasible before (like taking the whole database in memory) and using them in non conventional ways. Gmail? Ajax was around for long. I am not saying that the final product is not fantastic, what I am saying is that what makes it fantastic is NOT some new algorithm.

Sometimes, there are ideas that were considered terrible and reappear as fantastic (I am thinking here, for instance, of Python’s block by indenting, which existed before and was seen as dreadful).

What is really expensive/important in an application? Development time and effort is the fundamental piece. Having ideas is very cheap, developing products is very hard. The solution? Copyright. Copyright is the best way to protect the expensive development investment. [On a personal note, I give away all my code like all open source developers out there, but I am thinking on those that don’t want to do it]. There are other ways to profit from the code (other than copyright) like a service oriented approach, but I will not discuss that here.

There is another pragmatic issue: The massive number of algorithms that even a small application uses. I would be paying royalties to hundreds of companies even for my small projects. From a pragmatic standpoint that would put almost everyone out of the business. This is, by the away, another evidence that algorithms are cheap to invent, considering the many millions that are around to do all kinds of things in all kinds of ways.

Note that I didn’t even tried to frame this discussion on moral and political grounds, only on pragmatic and economic ones.

Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Digg
  • del.icio.us
  • Facebook
  • Google
  • connotea
  • DZone
  • Reddit
  • Slashdot
  • StumbleUpon
  • Technorati

Filed in: bioinformatics, software engineering

by: tiago

3 Comments

Bioinformatics, multi-core CPUs and grid computing: developer perspective (3/4)

First, my apologies for this long overdue part on developing life sciences applications considering multi-core CPUs and grid environments.
This text is somewhat long, and most probably will be revised in the future.

Developing for multi-core CPUs and grids borrows a lot (everything really) from concurrent, parallel and distributed programming. I do not intend, at all, to cover all those issues. That would be enough to fill in several books and articles. My objective is much smaller and simpler: to discuss some basic intuitions that can be interesting for people who develop “old style” software.

IMHO, the fundamental intuition is: Stop thinking synchronously, start thinking asynchronously… “What is this guy saying?”, you ask…

From synchronous to asynchronous: An example

Lets imagine that you have an application, computationally intensive, that simply does not get any performance improvement on your shiny new quad-core processor.

That application has a function, lets call it suck_cpu that takes 5 minutes to run, and you need to call it 10 times, each run being independent from previous runs, like this (example pseudo code in Python):

#we have an array input_parameters to be feed to suck_cpu
results = []
for i in range(10):
  results.append(suck_cpu(input_parameters[i])) # Takes 5 minutes to run
#The for loop took 50 minutes to run
for i in range(10):
  print "results from execution", i, "were", results

So, you run suck_cpu and you wait (block) until you have the result. You do this 10 times. For 50 minutes.

An alternative would be to ask to system to start a “task” running suck_cpu and your main program would not wait (not block) until the function ends. You would start 10 instances of suck_cpu, not wait any time (well, wait very little) and get back to your program. What could you do after starting the tasks (and not having the results back for now)? Well, it depends, but you could cater for your user (i.e., not block the user interface), do other tasks that are independent of the results of suck cpu or just wait (while informing the user of progress) for the tasks to end.

Even in the case of just waiting for results you would profit a lot: Your task scheduler could run n tasks in parallel (n being the number of your CPU cores), effectively getting all the computing power of your CPU (the main point of these series of articles).

So, how would your code look like?

#we have an array input_parameters to be feed to suck_cpu
results = []
tasks = []
for i in range(10):
  tasks.append(schedule_tasks(suck_cpu(input_parameters[i]))) # returns "immediately"
#The for loop ran in "no time"

So you would ask the task scheduler to run your suck_cpu tasks. At the end of the for loop your code was in control and the task scheduler was busy running your tasks. You would have the task IDs back (not the results). How was the task scheduler operating? Well, that would depend on its internal workings, but your tasks would probably be in one of 3 possible states:

  1. QUEUED - The task would be ready to run, but not started yet
  2. RUNNING - The task is running
  3. DONE - The task is done and you can collect the results

You could query the task scheduler for the state of your tasks, something like this:

for task in tasks:
  print "Task", task, "is in state", get_task_state(task)

The output, at the very beginning would be something like this:

Task T1 is in state RUNNING
Task T2 is in state RUNNING
Task T3 is in state RUNNING
Task T4 is in state RUNNING
Task T5 is in state QUEUED
Task T6 is in state QUEUED
Task T7 is in state QUEUED
Task T8 is in state QUEUED
Task T9 is in state QUEUED
Task T10 is in state QUEUED

Somewhere in the middle of the execution you would probably find something like this:

Task T1 is in state RUNNING
Task T2 is in state DONE
Task T3 is in state DONE
Task T4 is in state RUNNING
Task T5 is in state RUNNING
Task T6 is in state QUEUED
Task T7 is in state RUNNING
Task T8 is in state QUEUED
Task T9 is in state QUEUED
Task T10 is in state QUEUED

Notice one very important point, task T1 was not the first to finish, although it was the first to be scheduled. Also, T6 started before T5. The fundamental issue here is that execution of concurrent stuff is non-deterministic. That is things don’t normally happen in a clear, predetermined order, there is some degree of randomness even if your code is not stochastic at all. This will play a fundamental role on how things will have to be thought and designed… We will discuss this later.

Notice how a (not very dumb) task scheduler will be using all your CPU cores (ie 4 task are running simultaneously).

Of course, at the end, you expect all tasks to be in state DONE. Now you can collect all the results (by the way, you can collect and process results from individual tasks as soon as each one is ready).

You might be thinking: “woo that is a lot of overhead… now I have to keep track of tasks, check their state and so on…”. Yes, there is some overhead with all this, furthermore you have to change the way you are used to think. But I contend that this way is more natural and more close to the way you think: Tasks that are independent were only run in sequence because the existing programming paradigm forced you to conceive then in sequence, but, in your head, you know that they are independent and you were just forcing your way of thought to be similar to to the computing paradigm.

Well, maybe I am just saying this because of what lies ahead… :( . I am now going to discuss new types of bugs you have to be aware of and non-determinism.

The dark side: resource sharing and debugging

Unfortunately this type of programming paradigm introduces some new kinds of problems that you have to be aware of. I would like to separate them in 2 types of problmems:

  1. Resource sharing: Imagine that your tasks dump your output to a printer, you don’t want them to output to the same printer at the same time (like having parts of the same sheet with results for different tasks, interleaved). Most probably your tasks, while not sharing a printer, will share data structures and access to them will have to be cleverly controlled
  2. Non-determinism: Tasks run independent of each other, task schedulers schedule tasks in unexpected ways. This means that your code will run differently every time you run it, even if it is completely deterministic by itself.

The dark side: An example

I will use here the standard, boring example that is always used in concurrent programming:

Imagine that you have a bank account, your bank account has 50€. You make a deposit of 30€, at the same time somebody cashes in a check from your account (i.e., makes a withdrawal) of 20€. So, the end balance should be 60€.

Now, imagine that the code, at your bank to do this is:

def deposit(account_no, amount):
    before_deposit_balance = get_balance(account_no)
    set_balance(account_no, before_deposit_balance + amount)
 
 
def withdrawal(account_no, amount):
    before_withdrawal_balance = get_balance(account_no)
    if before_withdrawal_balance> amount: #Money has to be there...
        set_balance(account_no, before_withdrawal_balance - amount)

You would expect that these operations run atomically, i.e., one separated from one another, that is, on your account, this would happen:

#Current balance 50 Euros
before_deposit_balance = get_balance(account_no)
set_balance(account_no, before_deposit_balance + amount) #adding 30
#Current balance 80 Euros
before_withdrawal_balance = get_balance(account_no)
if before_withdrawal_balance> amount: #Money has to be there...
    set_balance(account_no, before_withdrawal_balance - amount) #Removing 20
#Final is 60

Another possibility would be the withdrawal to occur before the deposit (Start with 50, down to 30, final 60 again).

But, now imagine that the code runs like this (remember, the tasks might be running concurrently, and they don’t know what other tasks are doing)

#Current balance 50 Euros
before_deposit_balance = get_balance(account_no)
before_withdrawal_balance = get_balance(account_no)
set_balance(account_no, before_deposit_balance + amount) #adding 30
if before_withdrawal_balance> amount: #Money has to be there...
    set_balance(account_no, before_withdrawal_balance - amount) #Removing 20
#Final is 30

That is, both operations read the same amount and they both get 50. Then deposit writes 50+30, making it 80. After that withdrawal writes 50-30 making it 20. You just lost 30€ to the bank!. By the way, it could happen the other way around (you would end up with 80€), as an exercise you might try to imagine how the execution trace would look like for the final result to be 80€.
This is a problem of both resource sharing (i.e., both operations are sharing the same account) and non-determinism (i.e., execution of both operations can happen in a lot of different, interleaved, ways). There are actually many ways to solve this, the most obvious is to lock the account in some way, like this:

def deposit(account_no, amount):
    lock_account(account_no) #Get exclusive access to the account
    before_deposit_balance = get_balance(account_no)
    set_balance(account_no, before_deposit_balance + amount)
    release_account(account_no) #Release exclusive access to the account
 
 
def withdrawal(account_no, amount):
    lock_account(account_no) #Get exclusive access to the account
    before_withdrawal_balance = get_balance(account_no)
    if before_withdrawal_balance> amount: #Money has to be there...
        set_balance(account_no, before_withdrawal_balance - amount)
    release_account(account_no) #Release exclusive access to the account

In this case, when an operation starts, it invokes some magic which guarantees that account access is only granted to it while the operation lasts. If another task tries to lock the same account, that other task blocks until access is possible (this means that you should be careful with locking as you might loose parallelism when locking).

By the way, this introduces problems by itself (e.g., forgetting to release a locked account).

But what drives developers mad is the fact that the execution is non-deterministic, so, you might run your concurrent code 999 times and it works perfectly, but on the 1000th run some strange thing happens (which is very difficult to replicate, as it only happens, in this case, on average 1 every 1000 times).

There would be much more to say about this topic, but I think (hope) this relays the fundamental intuitions to help you start developing programs that make use of multi-core CPUs (more so, help you easily adapt existing software).

Grids

Programming for grids is actually quite similar to multi-core programming. The fundamental issue is that memory sharing is more difficult (takes more time), while in a multi-core computer the memory is shared (so it is easier and faster to share information between concurrent tasks) in a grid memory is not shared. This means that if your tasks need to communicate a lot then, in a grid you can expect time and network bandwidth to be lost during communication between tasks, this is probably the fundamental cause that in grids many algorithms don’t scale linearly with the number of grid nodes (i.e., if you double the number of grid nodes, your performance doesn’t duplicate as you loose with communication overhead). Anyways, lots of communication means lots of horrible bugs to correct, so try to keep communication low ;) .

The final part

In the final (fourth) part of these series I will discuss a real, existing case of Python code to deal with multi-core architectures. This code will be based on examples from population genetics simulators (selection detection and coalescent). Included will be a real example of a bug that I suffered when running a selection detection program in parts, concurrently :( . This part will be ready sometime in August.

Other approaches

While there are many issues that could be discussed, many technicalities, I will stop here. In the future I might discuss such topics as state and problems caused by it in concurrent programming (ie, advantages of functional and logic programming) or language support for concurrent programming (functional programming paradigms, Erlang support, Fortress, …). If I get back to the topic it will be mainly to discuss either linguistic approaches to the issue or anything “modern” (all the rest you can easily read in very good, old, well-established literature)

Part 5 of 4

This 4 part article will have a part 5 ;) . I decided that I want to talk a bit about volunteer computing projects like BOINC.

Article too big or complex? Don’t hesitate to comment, I am pretty sure that this text will need revision and your input is more than welcome.

Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Digg
  • del.icio.us
  • Facebook
  • Google
  • connotea
  • DZone
  • Reddit
  • Slashdot
  • StumbleUpon
  • Technorati

Filed in: bioinformatics, software engineering

by: tiago

4 Comments

Outsourcing and Agile Development methodologies

Being somewhat involved in outsourced development projects (note, outsource doesn’t have to mean going to somewhere very distant on the globe) I started to asked myself about the feasibility of Agile Development methodologies in outsource projects (when the outsource is made outside the client installations, sometimes - but not always - in a different country).

I would say that outsourcing tends to lend itself very well to waterfall development methodologies: Analysis in Sweden then design in Portugal then implementation in Chile, then testing and acceptance in the UK. Its easier to outsource a project (especially in different pieces to different contractors) if there is a clear distinction of phases.
If you are more agile, and have more parallel methodologies (which is clearly my personal preference), then its difficult, at best, to be able to split a task this way.
As I see it, agile methodologies don’t adapt well to this type of approach (very common these days).
Please prove I am wrong, as I would like to be!

Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Digg
  • del.icio.us
  • Facebook
  • Google
  • connotea
  • DZone
  • Reddit
  • Slashdot
  • StumbleUpon
  • Technorati

Filed in: software engineering

by: tiago

No Comments