Posts tagged ‘python’

This is a post intended to help people trying to produce Python libraries and applications in 64-bit Windows (i.e. native 64-bit binaries, not just 32-bit applications running on 64-bit Windows). It reflects what I have learned when trying to compile and run Biopython. Corrections and comments are most welcome. Nothing here written is “rocket science”, but it took me quite a while to compile this information, therefore it might make your life easier also.

Self-imposed requirement: To be able to compile with free compilers (preferably free as in speech, but free as in beer will do).

Point 1: Python does not compile with the free 64-bit compiler, mingw-w64. I actually did not try this, but I tried to compile biopython with mingw-w64 and Python headers were problematic (plus, if you go to Python bug database you will find references there to not being able to use mingw-w64).

Point 2: Visual Studio Express (the free beer compiler from Microsoft) does not natively target 64-bit architectures. Fear not, if you download the Windows SDK, then you get a 64-bit compiler. As far I as can see, this is the best current solution. The easiest way to compile (at least from the command line) is by using the Windows SDK Command Prompt (available in the Microsoft Windows SDK menu). Have a look at setenv (setenv.cmd) which allows you to set quite a few things about the target architecture. I am using VS2010 Express. In theory distutils only supports up to 2008 (v9), but I had no problems other than to add a /MANIFEST flag to the linker.

Point 3: If the code you are compiling depends on external libraries, then you mileage might vary (a lot), there are 4 options: (i) either the dependencies already have 64-bit versions (like NumPy which I needed for Biopython); (ii) There are no 64-bit versions, but it is easy for you to compile; (iii) There are no 64-bit versions and it is complex work to generate one. If it is case 3, then, well, case lost, back to the 32-bit version. You might have noticed that I mention FOUR options, well (iv) the fantastic porting effort done by Christoph Gohlke, if you need any Python 64-bit library check his porting page with truckloads of stuff (Compiled using Visual Studio professional editions).

So: Use VS Express and add the MS Windows SDK. setenv is your friend (and the Windows SDK command prompt, which is cmd with setenv, really).

As a final note, while I do grok pragmatism, I find less than desirable (to be euphemistic) that Python does not support the existing FREE (as in speech) 64-bit compiler on Windows.

Python’s matplotlib rocks. It is, by itself, the fundamental reason I stay with Python (JVM-based JFreeChart is powerful but – as most JVM libraries – an over-engineered, complex piece of software).

Doing multiple charts in matplotlib is not trivial, though. Well, it is trivial, but the end result is not perfect. Here I will present a step-by-step guide to creating better subplots. “Better” is, of course, a matter of taste, therefore every change will be presented separately, so that you can pick and chose whichever you prefer.

We start with this (as with all charts presented here, you will need to click to enlarge):

OK, the space between charts is wasted. Also the outside border is too big. Outside borders are less of a problem with a single chart, but when you have lots of stuff to present, then all space that you can get is important. So, we add this:

fig = pylab.figure()
fig.subplots_adjust(hspace=0.0001, wspace=0.0001,
                    bottom=0.07, top=0.96,
                    left=0.04, right=0.96)

The fig variable might have been initialized before, so use your own. Notice that subplots_adjust allows you to control both external borders (left, right, bottom, top) and space between charts (hspace, wspace). The result is this:

Before we discuss what obviously looks bad, we need to discuss something that looks good but is, in fact, problematic: The fonts. The fonts look of a proper size and (at least some of them) appear properly positioned. This apparent “goodness” is dependent on figure size and resolution (check the parameters of figure). If you develop this on screen and then want to generate an eps version at the end (say, to publish a scientific paper) you might be going for a rough ride and just detect the problem way later (when you generate the final, production version)! Suggestion: Try adjusting the figure parameters from the start (use both pylab.show and pylab.savefig – png and eps – to test the results). Here I will not spend time with more details about this (the same applies to line thickness, by the way).

OK, irrespective of the point above, the figure clearly has other problems. I opt to remove all titles (just remove the title lines) and remove the unnecessary X and Y axis labels. Note that we have 2 Y axis (one on the left, and one on the right). So, we apply the following code:

#We assume that sp comes from
#sp = pylab.subplot()...
 
#You should have this line if you have an extra Y-axis
#Ignore this if you do not have one (most cases won't have one)
par1 = sp.twinx()
 
#Apply this line to all subplots NOT on the extreme left
pylab.setp(sp.get_yticklabels(), visible=False)
 
#Apply this line to all subplots NOT on the bottom
pylab.setp(sp.get_xticklabels(), visible=False)
 
#Apply this line to all subplots NOT on the extreme right
pylab.setp(par1, visible=False)

Remember that we also removed the titles, so the result is:

Before we re-instate the titles (and put some legends), notice that the Y-axis labels slightly overlap from one chart to the other (this also happens with the X-axis labels, but it is less visible). For this we will redo the labels:

1
2
3
4
5
6
7
8
#if left and not bottom:
    pylab.yticks((0,.2,.4,.6,.8,1),("","0.2","0.4","0.6", "0.8","1.0"))
...
#if right and not bottom:
    pylab.yticks((0,0.5,1,1.5,2,2.5,3,3.5,4,),
                 ("", "0.5", "1", "1.5", "2", "2.5" , "3", "3.5", "4"))
#if bottom and not left:
    pylab.xticks((0,20,40,60,80,100),("","20","40","60", "08","100"))

Now, you might be confused: Lines 2 and 5 call the same function. Well line 2 is called before the creation of the second axe, line 5 after. There might be a more general way of doing this, but I was lazy enough to try and discover it. Result:

OK, now titles and axis labels. I will do this in a non-standard way: By hard-coding everything! You can, if you prefer activate the calls to .set_xlabel, and .title in the right position. In any case, the following example might serve to illustrate fig.text:

1
2
3
4
5
6
7
8
fig.text(0.5, 0.01, "Generations", ha="center", va="bottom", size="medium")
 
fig.text(0.20, 0.99, "Full epistasis", ha="center", va="top", size="medium")
fig.text(0.52, 0.99, "Mixed mode", ha="center", va="top", size="medium")
fig.text(0.83, 0.99, "DGF", ha="center", va="top", size="medium")
 
fig.text(0.015, 0.52, "Fst / LD(r)", ha="right", va="center", size="medium", rotation="vertical")
fig.text(0.99, 0.52, "Ratio", ha="right", va="center", size="medium", rotation=270)

ha is horizontal alignment (I will leave up to you to discover what va is ;) ). Note that the last 2 texts have different rotations. It is possible to put math notation in the text (LaTeX). Something like this: $\frac{1}{n}$. You need 2 things for this to work: add matplotlib.rc(‘text’, usetex = True) and have LaTex installed.

The end result is then:

Not fantastic, but you now have the techniques to alter at will. Just a final reminder that changing size and resolution will have a strong impact on font and line thickness. You might want to check your chart with all final media (screen, eps to print, …). For instance the example above would clearly benefit from thicker lines and larger fonts.

I am currently in the process of assuring that two of my bioinformatics applications are multi-platform. This would be a simulator of age structured populations newAge and a library to access the HapMap project, interPopula. I am also responsible for a small part of the Biopython project. I’ve been only concerned with Linux and Windows (I do not have a Mac, but Linux stuff seems to work there as Mac has a *nix base). I would like to share here my experiences, maybe for the benefits of others.

The overall experience has been quite positive. I do have a strong Java/JVM background and it seems to me that Python is almost as much write-once, run-anywhere. At least if “anywhere” is old fashioned computer platforms. I would split the issues as follows:

  1. Python code – I have not had a single problem to report. My code includes sub-process management and file system access. The only thing where some care was needed is the use of os.sep (so that directory + os.sep + file yields either directory/file or directory\file. I also maintain Java/JVM applications and I do remember having more problems than this when assuring cross-platform work (see more below), unfortunately I just forgot precisely the problem to document it.
  2. GUI (wxPython) – Here there are indeed some minor problems. The semantics of the API seems sligthly different (e.g. Skip methods in events), or at least the Windows implementation might be buggy as the same event is called twice if there is a .Skip call. There are also some minor layout issues, but those were to be expected.
  3. External expectations – matplotlib can rely on LaTeX to pretty-print text (formulas and such), one of my scripts did exactly that. Well, in most *nixes LaTeX is around, not so much on Windows, there was a subtle, slightly hidden dependency on LaTeX. With most other libraries there was no big problems to be found (e.g NumPy)
  4. The database API – This officially sucks! The problem is that parametrized SQL is not standard. For instance, with SQLite one writes “select column1 from table where column2=?” to be able to parametrize the value for column2, but with psycopg (PostgreSQL) you have “select column1 from table where column2=%s”. There is still, at least a positional version. Even if you write standardized SQL (and it is possible to write SQL that works in many different flavours of servers) you will end up writing different versions for different drivers because of the non-standardization of parameters.

I happen to be also deploying a Java Web Start based application (Groovy+JVM), namely ogaraK, a simulator of malaria population genetics. Multi-platform is not that easy if you have a Swing GUI. The semantics of windows sizing operations differ slightly on the Mac. Also some components (like HTML rendering widgets) are buggy in the Linux OpenJDK. Generating Java 6 .classes and then having recent Macs failing because they just go to 1.5 is irritating. While some newer versions of Mac OS X do indeed support 6, it does not seem realistic for now, if Mac support is desired to go with anything above 1.5 :( .

All in all, Python fares pretty well as long as database stuff is not involved. I would dare to say that multi-platform GUI development with wxPython is slightly easier than with Java Swing.

PS – Bias disclaimer: I am a strong supporter of the JVM platform (as long as the word Oracle is not included), much more than of Python (in fact a big part of my Python usage is on top of the JVM via Jython). So my “hidden agenda”, if there was one, would be pro-JVM.

ogaraK

I would like to announce ogaraK, a simulator of malaria population genetics.

ogaraK is a Java Web Start application developed in Groovy and Java which allows to simulate parasite population genetics. It can be used, for instance, to compare the effects of different drug deployment policies.

A set of Python scripts are made available to analyze the results (mainly frequencies of resistance loci over time).

ogaraK is free software (GPL v3).

ogaraK can also be used to simulate some of common theories about sex: Epistasis, Red-Queen and spatial heterogeneity (but not based on size, as the underlying model has no concept of population numbers).

I would like to ask the reader for some basic help: If you can, could you please test the application (A single click to run, if you have Java 1.5+)? I know most people will not understand the results or the parameters, but just a simple run would help (and reporting back if something goes wrong!). The objective is to make the application available to epidemiologists, and due to their lack of IT knowledge, a robust application is needed. Any comments would really be appreciated (I make no financial profit from this free application). If you know Java Web Start, then if you could activate the console and return any error detected (even from a blind execution), that would be most appreciated.

If you search the web you can find some discussions on whether IDEs for dynamic languages can be as helpful as IDEs for static languages. The issue is that static languages like Java have compile-time (thus easy to get at IDE-time) information in order to provide that fundamental code-completion functionality (among many others). If the IDE knows that a certain parameter is a String, than it is simple: it will present to you all the String methods when you type in the dot. For dynamic languages things get more complex are there is formally no (by definition) compile-time information. Some people would argue that there are ways around it (which you can already find in existing IDEs, I remember having some sort of code completion, years ago, on SPE – for Python). I will not add anything to that discussion here, this preamble was mainly for putting the reader in context. I am more interested in discussing good IDEs for DSLs.

With DSLs you get, most of the times, added syntax. Worse than that, you might fall into situations where you have changed (not only added) the initial language syntax; furthermore those syntax changes might even become valid only in runtime (imagine that a method is added to a class that is supplying DSL methods).

One example comes from Ioke and Prolog operator precedence and associativity rules which are changeable (see the previous post). It is not trivial to know if something like 1+2 is even syntactically valid (*). Even if it is syntactically valid things like association rules might change. In languages like Groovy you can add (e.g., through categories) methods to code blocs (from classes that can be dynamically changed). Then there is dynamic dispatching and macros. What is valid in a certain piece of code can be different from what is valid a few lines below. In fact, complete information of what is valid in a certain code block might require code execution. Or, to put in another way, it might be very difficult to have a completely helpful IDE! In this scenario there are 3 considerations that I think are worth being done:

1. One should not be discouraged for not having perfect solutions. Maybe it is not possible to determine all that can be expressed in a certain code block, but sometimes good approximations are enough.
2. On this issue, one good example comes from Prolog: In Prolog, syntax can be changed mainly through the use of the :-o p directive (and through asserts and retracts). The :-o p directive changes operators but is very easy to analyze pre-compilation/interpretation. So, the way DSLs are normally be constructed lend themselves very easily to code analysis which can be used by IDEs. This unfortunately not the case in most real-world languages.
3. It would be cool to have a language where DSL specifications could be automatically used to construct IDEs. The current real-world DSL-able languages (Ruby, Groovy, …) are DSL-enabled through indirect techniques which can be used to build DSLs (Dynamic reception, operator overload, whatever), in fact many of these techniques exist with other objectives than creating DSLs. If there was a declarative and explicit way to create DSLs, that information could be used to inform IDEs on parsing and other issues. An embedded, core way, to explicitly specify DSLs.

(*) I suppose some will see this as an argument for the fact that you can do pretty stupid (or at least unintuitive) things with DSLs. Well, you can do stupid things with everything. The question is not if you can or not, but the extent of bad use cases and how bad uses can creep in easily. Another (interesting) discussion, but not for now.

Preamble: In order to understand this post you should know a little bit (a little is enough, that is how much I know) about ExpandoMetaClass and Categories in Groovy.

DSLs that involve existing classes might be a source of long term sorrow. Let me give an example: Imagine that you want to make a small DSL to handle equations, like

x = new Symbol("x")
(2 * x).differentiate(x) //Result is 2

The problem is that the * operator of Numbers doesn’t know how to handle Symbols, therefore an exception would be raised. The obvious solutions as discussed before on mailing lists and blog posts are:

Categories

Categories would solve the problem, but at the expense of polluting the source with things like

use (Something.Category) {
  //code here
}

Not a disaster, but not pretty too…

Talking about disasters…

Expando over Numbers

The idea here would be to change the behavior of Numbers to be able to handle Symbols. Code would be very clean, no need for uses…

As somebody said on the groovy mailing list: This is disaster in the making. The problem is that I change Numbers, then, for another valid reason you change Numbers, somebody else also changes Numbers… This is chaos. Or at least it would make code from different sources potentially not inter operable or exhibiting very strange, buggy, behavior. This is clearly akin to the “global variable” problem. I believe that in the long term and with big software projects, this approach is a dead end.

Enter Python

Python actually has a workaround (I will not call it a clear, beautiful solution) that might be somewhat useful here. Imagine that you do

1 + x

The default 1 (default class for number) is not able to handle the symbol. For python that is OK, it will try to call a “right add” method of x (Search for __radd__ in this page). So, the default behavior is not to raise an exception if the left object cannot handle the operator, but to try to call the “right” version on the right object (if it fails then raise).

Not perfect, but might be just enough to avoid Expando in anger.

I do believe that people still don’t appreciate the consequences of Expanding core classes and the interop disaster that that can entail.

There seems to be some competition in the field that can be vaguely defined as “The next Java”(TM).

I don’t know if there will be a “next Java” to start with. Things seem to shape up in way where the JVM is our common interoperability platform and on top of it we have a an ecology of JVM based languages.

I have used Jython quite a lot but have several doubts about it, not only on the current status of Jython (lags a bit behind CPython) but I also deslike Python (when compared with the other languages dicussed here). As such I decided to evaluate the other Scala, Ruby and Groovy.

I have done a couple of small projects in Scala (A prototype DSL for modeling malaria resistance is available here) and JRuby. I am now starting with Groovy, and I think I’ve found my new love. Here I will try to explain why, among Groovy, Scala and JRuby, I have chosen Groovy. To preempt any religious war idea, I would like to say I have full respect for Scala, Ruby, which are, with Caml and Prolog among my favorite languages (for a true crusade and flame ask me for my opinion about Perl or Visual Basic 6 ;) ).

Steven Devijver suggests that Groovy is the language with more syntatic similarities with Java. I would say that, not only that, but on the semantics and everything, Groovy is the closest language to Java. And that is a good thing. The world (both in programming languages and all the rest) is never revolutionary. Revolutions, when they rarely happen, are either a disgrace or are not that much of big change below the surface. People normally prefer (for good and bad reasons) the path of least short term pain. Groovy delivers that: almost 0 cost in starting to code coming from a Java background. Most importantly Groovy does that but still delivers most of the new goodies. This is actually the cornerstone of my argument: path of least pain while delivering the good stuff (in some cases better than the competition, as we will see).

Let me start with the fundamental reasons why I dismiss JRuby (which is, nonetheless, my second option after Groovy). First, I would like to say, very honestly, that the work of the JRuby guys is nothing short of outstanding! But I have 3 problems:

  1. One, by definition, JRuby is based on Ruby, a language from outside the JVM. That means semantic hurdles, coupling issues between the two worlds (think, e.g., libraries)
  2. Most importantly (but connected with the first point): Typing. I am a bit far away from computing issues currently (I work with Malaria currently, so excuse me if I mess strong/explicit typing and such) but clearly the typing system of Ruby make like hard for IDEs (think IDEs to neded to tame those over engineered Java APIs) and automated tools around code. Debugging without explicit typing is also a pain in a big program (I actually suffered my first debug nightmare with typing systems with Caml, arguably the mother of Scala). Some might say that Scala type inference and Groovy duck typing also are problematic in this respect; while the argument might be correct both languages have mechanisms to support typical Java explicit/strong typing and as such profit from IDEs and automated analysis tools.
  3. Ugly perlisms. Although I have read somewhere that those might be deprecated in the future.

Ah… Scala… Mats Henricson argues that Scala is the only option because of elegance regarding multicore computing. I fundamentally disagree with his point – multicore programming is fundamental but Scala is not really a good solution, but before we get there, lets talk about other Scala issues.

Type inference. I have some experience with the “mother” of Scala, Caml. Type inference in Caml is really elegant: I don’t remember a single case of it failing and requiring the programmers’ help in discovering the type of a parameter. That is not the case with Scala, several times the compiler seems to be “lost in translation”. Some might say that this is because of JVM imposed constraints, but if that is the case then it would raise the argument of bringing a language with a foreign semantics to the JVM and the ugliness attached to the process.

My biggest peeve? Metaprogramming. I won’t give you my opinion about it because it really doesn’t exist. It is on the Scala wiki in the section “future”. I am sorry, but a 21st century language where meta programming is absent can only be called in “beta stage”. As a side note, there seems to be something lost in the ML branch of functional programming from Lisp in this regard (no introspection and such), that is a shame (How is Haskell in that respect?).

Ok, multicore computing. This is an area where I have some experience in the JVM: [Shameless plug] I invite you to have a look at my Java Web Start, Jython based, multicore aware evolutionary biology workbench LOSITAN. Furthermore I have written tutorials for the multicore paradigm and bioinformatics:

Bioinformatics, multi-core CPUs and grid computing: Introduction (1/4)


Bioinformatics, multi-core CPUs and grid computing: User perspective (2/4)

Most importantly in this context: Bioinformatics, multi-core CPUs and grid computing: developer perspective (3/4)

Mats argues that Scala Actors and immutable data types provide a simple and elegant solution to the extremely complex problem (I am calling it extremely complex, because I think it really is) of concurrent programming. Immutable data types… Does anyone believe that the hordes of existing Java developers/programmers are ready and willing to do radical conceptual jump to immutable data types? The change from C++ to Java was minor in terms of semantics, even the change from C to C++ was much less radical that a change requiring to “get rid of all variables”. How do you think the majority of programmers will react when you say: “Forget variables”? More, as Scala allows for imperative type of programming, what do you think most programmers idiom wil be: Imperative or functional? To makes things worse, in Scala a immutable is called a “val” and the mutable a “var”. Am I the only only picturing hordes of developers, with tight deadlines just swapping L’s for R’s?

I speak for myself here: in spite of having probably more experience with “immutable” languages (Prolog a lot, Caml a bit) than most developers, when I wrote Scala code, my reasoning was so tainted by “real world” imperative languages that it was really hard to write in a functional dialect. I have the background, enough free time, and the motivation to write functional code, but it was hard to get back in that mindset.

Scala only apparently solves the multi core problem. Give it to a typical developer and he will write imperative code, unless you put a functional zealot behind him (and give the said zealot a strong, resistant whip).

How to address the multicore issue? Clearly we have a problem here. A few ideas:

  • In many applications there is no big need to go multicore. In some cases lets not try to solve a problem that doesn’t exist in the first place.
  • Many multicore applications can survive very well with simple concurrency management. Not all applications require a PhD in concurrent programming.
  • Scala and the like. For those who can and are willing to go functional, why not? I have nothing against that. My only argument is that it won’t be mainstream.
  • The way of PAIN. Most developers will continue to use old languages and paradigms and SUFFER with it. Only after much suffering there will be motivation to try out new things and, say, endure the pain of learning a new paradigm. That suffering still hasn’t happen, only after this becomes a big problem, there will be interest in accepting new solutions.
  • A silver bullet that can be attached to the current programming paradigm. Sometimes it happens. Don’t misunderestimate (silly Bushism intended) the power of a “Black Swan” (A reference to Taleb’s book where he discusses the impact of the unexpected important events).

To finalize, I would like to say that I am not sticking with Groovy out of being conservative. Groovy seems to beat the competition in many areas (the biggest example is metaprogramming) and strikes a very good balance between being a “small evolutionary step” and delivering the goodies.

To really finalize, a caveat: my Groovy knowledge is still limited, one of these days you might read a post where I apologize for having written this ;)

Originally posted on Perfect Storm