Archives for August 2007

GUI metaprogramming example

Preamble

This is an example of metaprogramming in Jython. I would really like to have a simpler example (either in just Python or Java), but this is directly taken from what I am doing. The idiom that I am using, Pythonwise, is a bit strange and old (I am using eval instead of __getattribute__), that is because of Jython’s limitations. This can be seen as a more advanced programming technique (If you are starting to learn programming, you might want to skip this for now, just to avoid excessive entropy in your learning process). Although this example is in Jython, it applies to many programming languages (Python, Java, Ruby, Prolog, …) but not C or C++ (or Caml, unfortunately).

The problem at hand

I am doing a selection detection workbench (to detect loci under selection). At certain points in time, I need to disallow the user to input data to a lot of entry fields, like these:

Disabled fields

As you can see, they are all disabled.

How to do this? Option 1, go to all entry fields, one by one (more than 10, and changing) and call the method setEnabled(False). Lots of repeated code, and when there are changes I would have to add/remove a setEnabled.

Option 2. Do a piece of code to inspect my panel (a panel is what contains all the fields) object, check all object attributes that are entry fields and disable them. The point here is doing code that operates on the code itself. In this case, if one adds a new entry field to a panel, the code would automatically detect the field and disable it. How to code this?

1
2
3
4
5
6
7
8
9
10
import java.awt.Component
 
def disablePanel(panel):
    attrs = dir(panel)
    for attr in attrs:
        try:
            if eval('isinstance(panel.' + attr + ', Component)'):
                eval('panel.' + attr + '.setEnabled(False)')
        except TypeError: #Some attributes are write only
            pass

A small piece…
Line 4 (function dir) gets all attributes for the panel object.
Lines 7 and 8 do all the interesting work (eval, isinstance).
First, eval takes a string and executes it, so if you have

i = 1
i = eval('i+5')
print i

Will print 6. eval is very powerful (think about the possibilities of changing code in runtime). It is also quite dangerous, but I will not discuss that here…

isinstance checks to see if a certain object is an instance of a certain class, so

i = 1
print isinstance(i, int) # Will print True
print isinstance(i, str) # Will print False

So, back to our code
if eval(‘isinstance(panel.’ + attr + ‘, Component)’):
is evaluating if panel.’attribute name’ is an instance of Component. For instance, my panel has a attribute, called core (storing the number of cores), which is a drop down list, so, when the code checks for isinstance(panel.core, Component), it will eval to True and execute the next line which is:

eval(‘panel.’ + attr + ‘.setEnabled(False)’)
It will evaluate panel.’attribute name’.setEnabled(False), i.e., disable the field, in our previous example, it will do panel.core.setEnabled(False).

I will not explain the exception code as it is not important here.

So, a few lines now make it automatic to disable new entry fields, this without changing the code every time a field is added or removed (other than adding the field itself). Less code to maintain and less possibility of bugs.

I wanted just to illustrate the principle (the language used is not really important), but I need to stress out a fundamental point about this particular example in Python: Because of some Jython particularities I am using an old dialect to do this (Python gurus might be horrified), if you are using Python I recommend you to check __getattribute__ (to replace eval).

Share and Enjoy:
  • Digg
  • del.icio.us
  • Facebook
  • Google Bookmarks
  • connotea
  • DZone
  • Reddit
  • Slashdot
  • StumbleUpon
  • Technorati

Filed in: Java, Jython, Python, bioinformatics, metaprogramming

by: tiago

No Comments

Python, Ruby, Java and Threads

Greg Tyrelle made a very important comment regarding exploiting multiple cores and Python (which will surely be included in my next part on bioinformatics and multi-core computing).


First, my understanding of python threads is that they are not separate system level processes, but some kind of fake process that are python specific ? Trouble is that I see two separate process when I launch two Blast runs via threading ?

The other aspect of threading that I’m still not entirely clear about is how the global interpreter lock (GIL) fits into the picture. I get resource locking to prevent race conditions, but is the GIL also invoked each time an action that manipulates memory takes place in a thread ? I’ve heard this property of python makes it unsuitable for multi-core programming ?

I will trade formal correctness for clarity of explanation (namely I won’t discuss that much the difference between thread and process, as it would make this too techy and confusing).

Python uses real (i.e. native) threads. Ruby uses the so called green threads, those are “fake” (simulated). Ruby 2.0 will use native threads.

So, in theory, Python is OK in multi-core architectures. In practice there is a problem, a serious one, identified by Greg: the Global Interpreter Lock (GIL). The GIL makes it impossible for more than one thread to be executing Python code at a time. When you are dealing with Python code, even if you have many threads with many cores, only one thread can be executing Python code. This is not as serious as it looks, there are 4 ways to live with this:

  • If you use a thread to start an external process, that process is not under the control of the GIL (it is a separate process), so it can run concurrently (think BLASTing something) as it is running outside Python, that is, it will be using a different core. So I think it covers one fundamental use case in bioinformatics: using external, computationally intensive, programs. In fact you can start as many instances of external programs as the number of cores you have (or even more, in case you think it will be advantageous). Note that the thread that calls the external application will block (well… depends, but for simplicity lets assume it), but your other Python threads can continue in concurrency with the application.
  • This is subtle, but important: If you use CPython (the standard implementation), and you do your computationally intensive stuff in C (which makes sense – and is a common strategy – as Python is quite slow) then the C code, as long as it is not interacting with Python objects, can release the GIL and therefore make use of multiple cores. The Python code uses only one core, the C part might be using all the remaining available ones. This approach is not valid for Ruby because of the green threads issue (I am a simple Ruby newbie, so take my words with a grain of salt).
  • Now… this GIL problem (or the green threads issue in Ruby) disappears if you use Jython or JRuby, as they use the JVM native concurrency mechanisms which have no notion of acquiring an exclusive lock for execution. By the way you can also use JVM based interpreters to call native (non-JVM) applications (think BLAST again, from inside Java). To put this point in another way: the GIL/green threads problem is not a language limitation, it is a limitation of the standard (C based) implementations that other implementations might not share (and the Java implementations, in fact, DO NOT).
  • If you think about grids (and not multiple cores) then the problem disappears as we are then talking of different processes (even more, running on different hardware).

I am afraid of being too techy with this post (I am probably labeled as 100% computer nerd by now ;) ), but I think Greg’s point is fundamental and required some discussion.

In my defense ;) I would like to say that I am only writing too much about programming because I am in some sort of professional unclear phase, as soon as things get back on track I want to focus more on the biological part of things… Until there I will be writing of the issue that I know better, and that is, for better or worse, informatics…

Comments, especially constructive criticism, is, as always, welcome…

Share and Enjoy:
  • Digg
  • del.icio.us
  • Facebook
  • Google Bookmarks
  • connotea
  • DZone
  • Reddit
  • Slashdot
  • StumbleUpon
  • Technorati

Filed in: Java, Jython, Python, Ruby, bioinformatics

by: tiago

3 Comments

Thanks

I suppose that it is at the beginning of something that people need more encouragement, when things are rolling for a long time everything is easier: one has built confidence, has developed the routines and procedures, is inside a community…

Perfect Storm is just beginning and the feedback that I have received, in many forms, is really encouraging.

I would like to thank (in no particular order) to Alexei Drummond, Pedro Beltrão,
Deepak Singh, Neil Saunders, Animesh Sharma, Michael Barton and Richard Apodaca.

Thanks for all your comments big or small. They are really a source motivation.

And I would like to apologize to anyone that I might have forgotten.

Share and Enjoy:
  • Digg
  • del.icio.us
  • Facebook
  • Google Bookmarks
  • connotea
  • DZone
  • Reddit
  • Slashdot
  • StumbleUpon
  • Technorati

Filed in: uncategorized

by: tiago

1 Comment

Reusing results from (intensive) computations

Still inspired on Depth-First post The Best API May Be No API At All: PubChem and PDB, I would like to suggest a trick to reuse results of computationally intensive computations.

The concepts presented here are object serialization and persistence. The practical case will be presented in Python, but it also works in Java and most probably on other modern languages like Ruby.

Example scenario Imagine that you cross all PDB files available (I think there are around 40.000) and register the minimum distances between all irons that might exist near a protein and the protein itself. You might end up with a dictionary whose key is the PDB ID (like 1FZY) and the value is a list of minimum distance of all existing irons, to the protein, in Angstroms. From this list you might do all sorts of things like see which protein has the smallest distance to an iron, the average distance of all irons to proteins, analyze by enzyme types (by the way, you might also construct a table where, for each enzyme you record the type).

The first strategy might be to:

parse_and_process_all_PDB_files_gathering_required_information #Takes hours
#Now you have:
#enzyme_type  a dictionary that records, for each PDB entry,
#                    the enzyme type (if it is an enzyme)
#iron_distances a dictionary that holds, for each PDB entry,
#                    the minimum distance for each iron to the protein
compute_all_sorts_of_interesting_statistics # Rather fast

This has a big drawback: every time that you want to compute new statistics you have to repeat the whole process, taking hours. Furthermore, if you are not using a local copy of the PDB database, you will be dependent on the network and stressing the servers on the other side.

I would like to propose an alternative, have 2 programs:

parse_and_process_all_PDB_files_gathering_required_information #Takes hours
save_to_disk(enzime_type, iron_distances)

and

enzime_type, iron_distances = load_from_disk()
compute_all_sorts_of_interesting_statistics # Rather fast

After parsing and processing (and maybe fetch it from servers) the raw data, you would save it to disk. Whenever you wanted to compute new statistics you would load the processed data from disk and do the computations without repeating the parsing and processing.

That is, you would only run the parsing and processing phase once (or whenever you need new raw information). That is the time consuming part would only be run once or very rarely.

How to do this? Complicated you think? Not at all… You just have to like pickles (in Python)…

Pickles as an healthy diet component in your Python programming

We will make use of the Pickle module

From the module itself:

The pickle module implements a fundamental, but powerful algorithm for serializing and de-serializing a Python object structure. “Pickling” is the process whereby a Python object hierarchy is converted into a byte stream, and “unpickling” is the inverse operation, whereby a byte stream is converted back into an object hierarchy. Pickling (and unpickling) is alternatively known as “serialization”, “marshalling,” or “flattening”, however, to avoid confusion, the terms used here are “pickling” and “unpickling”.

OK, now continuing our example, how would we do it?

Saving enzyme_types and iron_distances is as simple as:

import pickle
 
#You do your stuff and create enzyme_types and iron_distances
 
pickle_file = open('data.pkl', 'wb')
pickle.dump(enzyme_types, pickle_file)
pickle.dump(iron_distances, pickle_file)
pickle_file.close()

That is it, it is as easy as this (You can go ahead and open data.pkl with a text editor if you are so inclined).

To load the data? It is as simple:

import pickle
 
pickle_file = open('data.pkl', 'rb')
enzyme_types = pickle.load(pickle_file)
iron_distances = pickle.load(pickle_file)
pickle_file.close()
 
#Compute statistics

Just one point: if you, for some reason, just want to load iron_distances (which was saved after enzyme_types) you still have to load iron_distances (for pickle to consume it, as it is saved before). On the other hand, if you just want to load enzyme_types, then you can ignore iron_distances, as it was saved after.

Caveat: you would need to create well designed intermediate structures, so that you didn’t need to run the parsing and processing phase all the time to create new intermediate structures (from my experience this is not that hard, even for people with little programming experience).

Share and Enjoy:
  • Digg
  • del.icio.us
  • Facebook
  • Google Bookmarks
  • connotea
  • DZone
  • Reddit
  • Slashdot
  • StumbleUpon
  • Technorati

Filed in: Python, bioinformatics

by: tiago

3 Comments

Comments to Alexei Drummond’s interview on Blind.Scientist

After a somewhat “rantish” response to Alexei Drummond’s interview on Blind.Scientist I have put in a few more well tempered comments to the interview. The reason I reacted so fast (and so unwisely) is that because most of the content impacts directly with what I am currently doing.

So here are my comments to the interview, commenting each point that interests me:

When biologists start asking about where they can learn to program a computer, just so they can do their job you know something is wrong!

This line of thinking seems to be highly pervasive with lots of researchers in Biology/Computational Biology/Bioinformatics. This is my main point of disagreement. I do think that in this brave new world everybody will have to know how to do basic scripting. I am not saying doing an industrial-strength application. Just doing basic data moving and processing. Like maths become in the 20th century a fundamental tool, basic programming will become one also. Especially when more data becomes available and lab work becomes more automated and fast/easy to do.

Firstly, software development isn’t science.

This I completely agree, although I suppose the author is not referring to some of the underlying algorithms that are below the application (like an alignment algorithm). But, just doing an application is not science, it is enabling science, which is quite different.

Secondly, most academic programmers are not interested in (or good at) designing user interfaces, and certainly developing software is not a scientific outcome that gets recognized like publishing papers does.

Designing a good user interface is surely something I don’t think should be required by scientists when I talk that doing basic scripting is becoming a requirement. It is interesting to note that one can publish papers on applications (program notes, application notes, …), so one can make a “scientific” CV with applications. If it makes sense to use the same reward system for applications and research papers is a completely different issue, but, for now one can get publication entries on the CV with applications.

Thirdly, academics are quite bad at supporting software and documenting it.

Most are bad at developing software in the first place. Using publication as a reward for an application might make sense (if at all) after an application is well established, but I doubt it makes much sense in the beginning of the life cycle of the application. Call me cynical, but on this publish or perish culture, putting the reward in the beginning of the life of a product is a strong invitation not to support it at all (as the main reward is already obtained).

So it seemed to me, that for a lot of reasons, a professional software company was the best avenue to realize a software system that would dramatically improve the productivity of molecular biologists by putting bioinformatics at their fingertips.

Makes full sense, but the idea that all of the programming effort can be taken from the hands of scientists seems to me exaggerated. My main line of reasoning is that most science is a creative process, not a factory process, and some of that creativity cannot be foreseen by application developers, so, some “tweaking” will be needed by the final user (even in less creative professions sometimes word processors and spreadsheets have to be programmed), that tweaking is really something like “script programming”.

Java is a general-purpose programming language — so you can do in Java pretty much anything you can do in software. The main reason for choosing Java is that it is very easy to write sophisticated user interfaces that run on Windows, Linux and Mac OS X.

I currently use the same line of reasoning when developing software. I am currently working on a selection detection application that works inside JVM. Note that I say JVM and not Java. One can use the good things (portability of libraries, especially Swing and AWT) of the JVM using other languages that work on JVM, Jython and JRuby come to mind.

While I do subscribe to the JVM almost completely, I have some doubts about Java. For small applications it is clearly an over engineered language. Even for big applications, although I think some of Java features are good (like explicit typing), there is space for extensibility to be provided by scripting languages like Jython/JRuby (MODELER4SIMCOAL2 works just like that).

For biologists beginning programming (another issue), I would surely not start by teaching Java because of the excessive verbosity and difficulty in getting “simple things done” that puts off a lot of people. Furthermore the learning curve is steep. Python would be my clear suggestion on this front.

Our goal is a happy marriage where academic programmers can get on with developing great new algorithms, and Geneious can provide the interoperability, the user interface and the support.

One of the best ideas I have read in a long time. There is a big difference in thinking an algorithm and the process of developing an industrial strength, easy to use (and, I would like to add, script and extend) application plus maintain/support it. The reward for the algorithm that makes more sense to me is the publication, the reward for the application should be money, to put it in simple terms. The bridging between the two sides of the equation can (should) be done in the way Alexi proposes.

Share and Enjoy:
  • Digg
  • del.icio.us
  • Facebook
  • Google Bookmarks
  • connotea
  • DZone
  • Reddit
  • Slashdot
  • StumbleUpon
  • Technorati

Filed in: bioinformatics, science

by: tiago

5 Comments

PDB, accessing data, APIs

I was reading Depth-First‘s article on The Best API May Be No API At All: PubChem and PDB and decided to relay here my experience in helping a colleague processing of PDB data.

To begin with, that person wanted to bulk analyze many (thousands) PDB files, furthermore she only knew Java and very little of it.

My suggestion was:

  1. Download all PDB files from PDB using ftp. All being the keyword here
  2. Use Python
  3. Parse the files yourself (i.e., don’t use Biopython’s Bio.PDB)

This goes in line with the idea of the “best API being no API at all” (I am not suggesting this generally, but in this case it made sense).

I suppose some justifications to a lot of counter intuitive suggestions might be in order…

For point 1: The person really wanted to analyze a lot of files in bulk, it made sense just to download them all. As far as I remember we are talking of less than 10GB. I ask myself, that, even in cases we only want to use hundreds/few thousand PDBs, this might make sense: 10GB download is not that much nowadays, it doesn’t take that much space on disk, it doesn’t take that much bandwidth. Regarding being friendly to RCSB I ask what is worse for them: A big download or many queries using CPU, databases, etc? For users, they can now query locally, and if you look at the PDB format, a few pipes of greps can go a long way and give a lot of flexibility.

For point 2: I would like to stress out that the person knew very little of Java. I contend that learning Python (with a smoother learning curve than Java) takes less time and is less frustrating (at least for users that are concerned only with results and not with the “joy of programming”) than learning/using the remaining Java plus the required system and Bio libraries (remember, Java libraries are much tougher and over engineered than Python’s).

For point 3: PDB file format is reasonably easy. Between learning a new API (which is not for free and requires understanding the API developers mind) and processing the files manually I suggested processing the files manually. This had the added benefit of making the person learning simple and very useful file processing. Please note that I am not suggesting reinventing the wheel (in fact I tend to be strongly opposed to that). But with easy file processing it seemed to make sense. I would like to say, in my defense ;) , that I suggested using the wonderful matplotlib for chart drawing and it never crossed my mind suggesting implementing a chart library from scratch.

So, sometimes, not using an existing API might be an approach worth considering.

PS – I still stand by my suggestions. Currently the person seems to have lots of questions about the chemistry of the problem. The programming problems are very rare. And I think that is the main point, computing and programming should not be the fundamental issue.

Share and Enjoy:
  • Digg
  • del.icio.us
  • Facebook
  • Google Bookmarks
  • connotea
  • DZone
  • Reddit
  • Slashdot
  • StumbleUpon
  • Technorati

Filed in: Java, Python, bioinformatics, chemistry

by: tiago

No Comments

Easy to use bioinformatics interfaces (2/2): MODELER4SIMCOAL2

In yet another shameless promotion exercise I would like to present a easy to use interface in the area of coalescent simulation:

MODELER4SIMCOAL2

modeler4simcoal2 (m4s2) is a modeler for coalescent processes. It allows the modeling of both demographies and chromosomes (i.e., markers with linkage relationships in multiple chromosome blocks).

m4s2 is a Java Web Start application (requiring Java 1.4, available for Windows, Mac and Linux among others). It requires no installation and can be run directly from the web.

The purpose of m4s2 is to allow biologists to concentrate more on biology and the underlying models used on analysis (and less on having to learn a new computer simulation tools). We expect that m4s2 will lower the barrier for coalescent simulator use.

m4s2 was published on Bioinformatics.

Share and Enjoy:
  • Digg
  • del.icio.us
  • Facebook
  • Google Bookmarks
  • connotea
  • DZone
  • Reddit
  • Slashdot
  • StumbleUpon
  • Technorati

Filed in: Java, Jython, Python, bioinformatics

by: tiago

No Comments

Bioinformatics, visual programming, “unemployment”

Through Public Rambling, I arrived at this piece of “I don’t know nothing about the history of informatics at all”.

Some decades ago there was this idea in informatics that visual programming languages would make programming useless, programmers jobless and empower users to do everything they needed with a computer.

The funny thing is, we are in 2007 and programming is still one of the best technical careers in most places (from the point of view of employability). The problems with jobs in programming have all to do with outsourcing and nothing to do with users replacing programmers (in fact, (programming) complexity is growing, not shrinking as informatics enters more and more areas of our existence).

People that think visual programming tools will make programmers redundant fail to grasp the basic fundamental insight about manipulating computing systems: You can put a “visual”, “easy to use” interface on top of the computing system, but, at the end of the day people using those systems will have to have in their head fundamental concepts about data structures, algorithms, etc… It doesn’t matter how do you present the system to users/programmers, at the end of the day, what matters, when manipulating computational systems, is the conceptual framework inside your head. Do you have the right concepts or not? If you have them you will program in Python, Perl, Java, Visual Basic, Your_Invented_Thingy, Visual_Programming_Stuff, etc… very easily, if not things will be difficult or even impossible.

Well to be honest how you manipulate the computational system does matter (that is why most people use scripting languages instead of C or Assembler), but history has more or less proven that visual programming environments happen to be some of the worse, less productive that exist (with some small exceptions).

There is one way to make biologists make good use of computers and that is for biologists to learn the basic CS concepts. Like they have to have the basic math/statistical concepts, and that is accepted. I don’t see that happening as most biologists (surely not all) try in as much as possible to learn the least informatics at all.

Don’t think I am ranting (and please accept my apologies for the rant format) because I see my “job” (I am a CS guy) disappear, it is exactly the opposite, while this mentality of “easy to use”, “programming redundant” is around there will be an infinite space for people like me to just produce “easy to use” tools for doing a simple task. Why? Because most people lack the conceptual framework to assemble, from existing bits and pieces, and automatically process information in any trivially novel way and that leaves a lot of space for informatics guys/gals.

The problem is that, until people in biology understand that they need to grasp fundamental computing concepts the field will progress at a much slower pace that it could. I would love to see the day where I would be jobless because people control the computational infrastructure to their needs, but for now what I see is precisely the opposite.

In a self-centered way I am only grateful to the state of things…

PS – That being said, I will come back to the article/interview because I think it really addresses interesting issues in Bioinformatics that deserve to be discussed.

Share and Enjoy:
  • Digg
  • del.icio.us
  • Facebook
  • Google Bookmarks
  • connotea
  • DZone
  • Reddit
  • Slashdot
  • StumbleUpon
  • Technorati

Filed in: bioinformatics

by: tiago

2 Comments

Blind review process

I am currently reviewing five papers for a conference which uses blind review (I will blind you of the conference name ;) ).

On 2 papers, the author forgot to remove his ID.

On 2 papers, googling for necessary content to perform the review trivially identifies the authors (like, there is only one result for a certain query on scholar).

On 1 paper, there is a link for a a personal homepage where you can find datasets related to results.

Share and Enjoy:
  • Digg
  • del.icio.us
  • Facebook
  • Google Bookmarks
  • connotea
  • DZone
  • Reddit
  • Slashdot
  • StumbleUpon
  • Technorati

Filed in: science

by: tiago

No Comments

Post Doc position: Concurrency and Parallelism in BioInformatics

Professor Cardoso e Cunha was my BSc supervisor, 10 years ago, I can only recommend, based on my previous experience with him, that potential interested individuals give really good consideration to the position below:

Post Doc Positions Open
at CITI- Centre for Informatics and IT – FCT/UNL

http://citi.di.fct.unl.pt

Job/Fellowship Reference: C2007-418-CITI-1
Concurrency and Parallelism in BioInformatics

http://www.eracareers.pt/opportunities/index.aspx?task=global&jobId=6296

Applications are sought for RESEARCHER positions. Successful applicants will engage in the development of computation models, languages and algorithms to exploit Concurrency and Parallelism in BioInformatics, involving work both in the application of computer science techniques for modeling biological, biochemistry and biomedical systems, and in the development of biologically inspired computational models.

Important dimensions of this research include the specification and verification of space-time properties of complex systems, modal logics for concurrency, process calculi, and the development of parallel and distributed computing models and algorithms for BioInformatics, enabling the processing of complex simulations with access to very large data sets. Successful candidates will join the research groups of the CITI Research Centre, and conduct joint research
in the context of interdisciplinary collaborations to exploit the relationships between Computer and Information Sciences and Life Sciences.

More information about the CITI – Centre for Informatics and IT, a research unit of the Departamento de Informatica, Universidade Nova de Lisboa, may be found in the CITI web site http://citi.di.fct.unl.pt

Any candidate must have a post-doctorate research experience of at least 3 years, be knowledgeable in Computer Science in general, preferably in some of the areas mentioned above, and in its relationships with BioInformatics. Successful candidates must have
competence particularly in fields like Concurrency Models, Parallel Algorithms, Principles of Programming Languages and Models. Applicants with a strong Computer Science background will have priority.

Please send detailed CV and two reference letters before 31 August 2007 to:

Prof. Jose C. Cunha
CITI – Centre for Informatics and IT
Faculdade de Ciencias e Tecnologia
Universidade Nova de Lisboa
2829-516 Caparica
Portugal
e-mail: jcc@di.fct.unl.pt
tel: +351 212948536
fax: +351 212948541

Share and Enjoy:
  • Digg
  • del.icio.us
  • Facebook
  • Google Bookmarks
  • connotea
  • DZone
  • Reddit
  • Slashdot
  • StumbleUpon
  • Technorati

Filed in: bioinformatics

by: tiago

1 Comment