First, my apologies for this long overdue part on developing life sciences applications considering multi-core CPUs and grid environments.
This text is somewhat long, and most probably will be revised in the future.
Developing for multi-core CPUs and grids borrows a lot (everything really) from concurrent, parallel and distributed programming. I do not intend, at all, to cover all those issues. That would be enough to fill in several books and articles. My objective is much smaller and simpler: to discuss some basic intuitions that can be interesting for people who develop “old style” software.
IMHO, the fundamental intuition is: Stop thinking synchronously, start thinking asynchronously… “What is this guy saying?”, you ask…
From synchronous to asynchronous: An example
Lets imagine that you have an application, computationally intensive, that simply does not get any performance improvement on your shiny new quad-core processor.
That application has a function, lets call it suck_cpu that takes 5 minutes to run, and you need to call it 10 times, each run being independent from previous runs, like this (example pseudo code in Python):
#we have an array input_parameters to be feed to suck_cpu
results = []
for i in range(10):
results.append(suck_cpu(input_parameters[i])) # Takes 5 minutes to run
#The for loop took 50 minutes to run
for i in range(10):
print "results from execution", i, "were", results
So, you run suck_cpu and you wait (block) until you have the result. You do this 10 times. For 50 minutes.
An alternative would be to ask to system to start a “task” running suck_cpu and your main program would not wait (not block) until the function ends. You would start 10 instances of suck_cpu, not wait any time (well, wait very little) and get back to your program. What could you do after starting the tasks (and not having the results back for now)? Well, it depends, but you could cater for your user (i.e., not block the user interface), do other tasks that are independent of the results of suck cpu or just wait (while informing the user of progress) for the tasks to end.
Even in the case of just waiting for results you would profit a lot: Your task scheduler could run n tasks in parallel (n being the number of your CPU cores), effectively getting all the computing power of your CPU (the main point of these series of articles).
So, how would your code look like?
#we have an array input_parameters to be feed to suck_cpu
results = []
tasks = []
for i in range(10):
tasks.append(schedule_tasks(suck_cpu(input_parameters[i]))) # returns "immediately"
#The for loop ran in "no time"
So you would ask the task scheduler to run your suck_cpu tasks. At the end of the for loop your code was in control and the task scheduler was busy running your tasks. You would have the task IDs back (not the results). How was the task scheduler operating? Well, that would depend on its internal workings, but your tasks would probably be in one of 3 possible states:
- QUEUED - The task would be ready to run, but not started yet
- RUNNING - The task is running
- DONE - The task is done and you can collect the results
You could query the task scheduler for the state of your tasks, something like this:
for task in tasks:
print "Task", task, "is in state", get_task_state(task)
The output, at the very beginning would be something like this:
Task T1 is in state RUNNING
Task T2 is in state RUNNING
Task T3 is in state RUNNING
Task T4 is in state RUNNING
Task T5 is in state QUEUED
Task T6 is in state QUEUED
Task T7 is in state QUEUED
Task T8 is in state QUEUED
Task T9 is in state QUEUED
Task T10 is in state QUEUED
Somewhere in the middle of the execution you would probably find something like this:
Task T1 is in state RUNNING
Task T2 is in state DONE
Task T3 is in state DONE
Task T4 is in state RUNNING
Task T5 is in state RUNNING
Task T6 is in state QUEUED
Task T7 is in state RUNNING
Task T8 is in state QUEUED
Task T9 is in state QUEUED
Task T10 is in state QUEUED
Notice one very important point, task T1 was not the first to finish, although it was the first to be scheduled. Also, T6 started before T5. The fundamental issue here is that execution of concurrent stuff is non-deterministic. That is things don’t normally happen in a clear, predetermined order, there is some degree of randomness even if your code is not stochastic at all. This will play a fundamental role on how things will have to be thought and designed… We will discuss this later.
Notice how a (not very dumb) task scheduler will be using all your CPU cores (ie 4 task are running simultaneously).
Of course, at the end, you expect all tasks to be in state DONE. Now you can collect all the results (by the way, you can collect and process results from individual tasks as soon as each one is ready).
You might be thinking: “woo that is a lot of overhead… now I have to keep track of tasks, check their state and so on…”. Yes, there is some overhead with all this, furthermore you have to change the way you are used to think. But I contend that this way is more natural and more close to the way you think: Tasks that are independent were only run in sequence because the existing programming paradigm forced you to conceive then in sequence, but, in your head, you know that they are independent and you were just forcing your way of thought to be similar to to the computing paradigm.
Well, maybe I am just saying this because of what lies ahead…
. I am now going to discuss new types of bugs you have to be aware of and non-determinism.
The dark side: resource sharing and debugging
Unfortunately this type of programming paradigm introduces some new kinds of problems that you have to be aware of. I would like to separate them in 2 types of problmems:
- Resource sharing: Imagine that your tasks dump your output to a printer, you don’t want them to output to the same printer at the same time (like having parts of the same sheet with results for different tasks, interleaved). Most probably your tasks, while not sharing a printer, will share data structures and access to them will have to be cleverly controlled
- Non-determinism: Tasks run independent of each other, task schedulers schedule tasks in unexpected ways. This means that your code will run differently every time you run it, even if it is completely deterministic by itself.
The dark side: An example
I will use here the standard, boring example that is always used in concurrent programming:
Imagine that you have a bank account, your bank account has 50€. You make a deposit of 30€, at the same time somebody cashes in a check from your account (i.e., makes a withdrawal) of 20€. So, the end balance should be 60€.
Now, imagine that the code, at your bank to do this is:
def deposit(account_no, amount):
before_deposit_balance = get_balance(account_no)
set_balance(account_no, before_deposit_balance + amount)
def withdrawal(account_no, amount):
before_withdrawal_balance = get_balance(account_no)
if before_withdrawal_balance> amount: #Money has to be there...
set_balance(account_no, before_withdrawal_balance - amount)
You would expect that these operations run atomically, i.e., one separated from one another, that is, on your account, this would happen:
#Current balance 50 Euros
before_deposit_balance = get_balance(account_no)
set_balance(account_no, before_deposit_balance + amount) #adding 30
#Current balance 80 Euros
before_withdrawal_balance = get_balance(account_no)
if before_withdrawal_balance> amount: #Money has to be there...
set_balance(account_no, before_withdrawal_balance - amount) #Removing 20
#Final is 60
Another possibility would be the withdrawal to occur before the deposit (Start with 50, down to 30, final 60 again).
But, now imagine that the code runs like this (remember, the tasks might be running concurrently, and they don’t know what other tasks are doing)
#Current balance 50 Euros
before_deposit_balance = get_balance(account_no)
before_withdrawal_balance = get_balance(account_no)
set_balance(account_no, before_deposit_balance + amount) #adding 30
if before_withdrawal_balance> amount: #Money has to be there...
set_balance(account_no, before_withdrawal_balance - amount) #Removing 20
#Final is 30
That is, both operations read the same amount and they both get 50. Then deposit writes 50+30, making it 80. After that withdrawal writes 50-30 making it 20. You just lost 30€ to the bank!. By the way, it could happen the other way around (you would end up with 80€), as an exercise you might try to imagine how the execution trace would look like for the final result to be 80€.
This is a problem of both resource sharing (i.e., both operations are sharing the same account) and non-determinism (i.e., execution of both operations can happen in a lot of different, interleaved, ways). There are actually many ways to solve this, the most obvious is to lock the account in some way, like this:
def deposit(account_no, amount):
lock_account(account_no) #Get exclusive access to the account
before_deposit_balance = get_balance(account_no)
set_balance(account_no, before_deposit_balance + amount)
release_account(account_no) #Release exclusive access to the account
def withdrawal(account_no, amount):
lock_account(account_no) #Get exclusive access to the account
before_withdrawal_balance = get_balance(account_no)
if before_withdrawal_balance> amount: #Money has to be there...
set_balance(account_no, before_withdrawal_balance - amount)
release_account(account_no) #Release exclusive access to the account
In this case, when an operation starts, it invokes some magic which guarantees that account access is only granted to it while the operation lasts. If another task tries to lock the same account, that other task blocks until access is possible (this means that you should be careful with locking as you might loose parallelism when locking).
By the way, this introduces problems by itself (e.g., forgetting to release a locked account).
But what drives developers mad is the fact that the execution is non-deterministic, so, you might run your concurrent code 999 times and it works perfectly, but on the 1000th run some strange thing happens (which is very difficult to replicate, as it only happens, in this case, on average 1 every 1000 times).
There would be much more to say about this topic, but I think (hope) this relays the fundamental intuitions to help you start developing programs that make use of multi-core CPUs (more so, help you easily adapt existing software).
Grids
Programming for grids is actually quite similar to multi-core programming. The fundamental issue is that memory sharing is more difficult (takes more time), while in a multi-core computer the memory is shared (so it is easier and faster to share information between concurrent tasks) in a grid memory is not shared. This means that if your tasks need to communicate a lot then, in a grid you can expect time and network bandwidth to be lost during communication between tasks, this is probably the fundamental cause that in grids many algorithms don’t scale linearly with the number of grid nodes (i.e., if you double the number of grid nodes, your performance doesn’t duplicate as you loose with communication overhead). Anyways, lots of communication means lots of horrible bugs to correct, so try to keep communication low
.
The final part
In the final (fourth) part of these series I will discuss a real, existing case of Python code to deal with multi-core architectures. This code will be based on examples from population genetics simulators (selection detection and coalescent). Included will be a real example of a bug that I suffered when running a selection detection program in parts, concurrently
. This part will be ready sometime in August.
Other approaches
While there are many issues that could be discussed, many technicalities, I will stop here. In the future I might discuss such topics as state and problems caused by it in concurrent programming (ie, advantages of functional and logic programming) or language support for concurrent programming (functional programming paradigms, Erlang support, Fortress, …). If I get back to the topic it will be mainly to discuss either linguistic approaches to the issue or anything “modern” (all the rest you can easily read in very good, old, well-established literature)
Part 5 of 4
This 4 part article will have a part 5
. I decided that I want to talk a bit about volunteer computing projects like BOINC.
Article too big or complex? Don’t hesitate to comment, I am pretty sure that this text will need revision and your input is more than welcome.
Share and Enjoy:
These icons link to social bookmarking sites where readers can share and discover new web pages.