Posts tagged ‘java’

Preamble: The problem of writing a defense of a certain thing X is that, most people interpret that as an attack to potential alternatives. This is not how this post should be read. There is no One Single Solution. My defense of Groovy is based on a set of assumptions that do not hold true for many people. In fact, they do not even hold true for myself. Different people and different development problems entail different solutions.

My main assumption in defense of Groovy is that you are, a Java person. Java is your day to day programming language and you are comfortable with that. Though you are comfortable with that you want to try something different: maybe you want to try a scripting language, maybe you do not like too much boilerplate code.

Why Groovy? Because it gives you a lot of goodies with essentially no learning curve. An illustrative example: I’ve spent the whole morning thinking that I was editing a Java source file, but I was indeed working on a Groovy file. You see, Groovy not being a superset of Java ends up being almost that: Code in Java is, in most cases already code in Groovy. So, if you know Java, you can write Groovy. You will not write the best idiomatic Groovy, you will not gain any of Groovy’s goodies (and you will pay a performance penalty, BTW). But this is an amazing head-start if you want to go in the direction of higher-level languages (I would argue that it is even smoother than C from C++ as the paradigm does not change from Java to Groovy, it is OO to OO). If you go the Jython way then you have to learn a new language. If you go Scala or Clojure then you have to learn a whole new paradigm (and deal with the impedance of imperative semantics in typical Java libs against standard functional semantics).

For little to no cost you have now many goodies associated with scripting languages (dynamic, low boiler-plate coding, DSLs, better meta-programming, …). In some cases Groovy even out-competes supposedly more elegant languages (are Scala meta-programming facilities still as bad as in the past?).

Another interesting advantage of Groovy is that, if you want to revert back to Java then it is much easier. Why would you want to do that? Well, for performance reasons. In a Groovy application that I have, a small part of the code is extremely intensive, so I had to rewrite it in Java. This revealed to be a trivial exercise (similar syntax, similar semantics).

Again, let me stress out this: your requirements and your personal path are fundamental in any decision you take. There is no true language (OK, Prolog…. ;) ). Different people, different approaches. All I am saying is: if your background is strongly grounded on Java and you feel comfortable with that, then Groovy is probably the way to go.

Disclaimer: While I have a couple of applications made in Groovy, most of my scripting efforts in the JVM world involve Jython (another fine language implementation, appropriate in a different set of circumstances) and I also believe, that from a declarative and highly expressive language point of view, what is being done with Clojure certainly deserves mention.

First, let me tell you a little story: A few years ago I worked for an university and we decided to replace a closed, proprietary backend infrastructure with an open one. The strange thing is that we replaced IBM with… IBM. You see, a long time ago IBM had a self-centered, closed mentality but since Gerstner’s time as IBM CEO things changed a lot. So, the old unpalatable IBM was replaced with the new, desirable IBM. A similar argument can be made for Sun… you probably still remember the closed Sun, the closed Java. Interestingly when we replaced IBM with IBM we also replaced the database backend: from Oracle to IBM.

Oracle.

Let me continue my little story: Fast-forward a few years and I accepted a job at the same university as data centre manager. The 3 biggest universities in the city had bought the same accounting/payroll system. This system had a Oracle backend: database, application server and financial extensions. It so happened that each component had completely different licensing terms: one was on concurrent connections another on concurrent users, yet another on the total number of users. The notion of user also varied: The financial application had thousands of users, but it connected with a single user to the application server. The usage of multiple CPUs was also subjected to licensing. I hate to think on the countless hours spent by the data centre managers just parsing all this. It still gives me a headache.

But it does not stop here: even as a manager I was still very technically inclined (and had previous experience as a database manager): I would subscribe to the view that Oracle databases are a too-much complex database system that seem to be highly profitable to Oracle and the contractors that are payed to maintain such a complex system. Between Oracle DB and IBM DB/2 I would take DB/2 any time (at least 5 years ago where my professional path changed). Of course, for most applications PostgreSQL is more than enough…

The point is, as you probably have noticed by now: I am not comfortable with Oracle’s corporate culture. I am not comfortable with Oracle taking the lead on Java. In fact I can think of no worse corporation to lead Java.

I would also like to draw your attention to size and power bias. My point is this: people tend to dislike Microsoft, but part of that dislike does not come from corporate culture but from the influence and power that Microsoft has (still) in computing. I dread to think of the consequences of people like Larry Ellison or Steve Jobs having the same influence as Bill Gates. I am not defending Bill, just suggesting that it could be much worse. We do not feel the pain of Oracle (or Apple for that matter) because they do not have the massive OS/Office market share that MS has.

I always had a love relationship with Java (though I have a background in declarative languages). It was amazing to see the appearance of new languages on top of the JVM (I use Groovy, Clojure and Jython). The idea of a shared VM (and shared libraries) which a set of cooperating and competing languages on top is wonderful…

But I am starting to doubt that the JVM world will stay an open community where the best ideas can be incepted and flourish.

If IBM had bought Sun…

I do not trust you, Larry!


Recommending readings (from James Gosling’s blog):
Quite the firestorm
The shit finally its the fan
Cynical chuckes

BioJava has a parser for the nexus file format. While Nexus supports quite a lot of information (like DNA sequences), the most complex part to process is the phylogenetic tree descriptions based on the Newick format. Below you can find a small tutorial on how to process these.

BioJava’s parser relies on JGraphT to create a representation of the “tree”. The tree is actually really more an acyclic graph than a tree, though some trees are rooted (and therefore trees in the proper sense). Manipulating the JGraphT weighted graph is the complicated part, not really the BioJava interface. Note that JGraphT objects can be easily rendered using the JGraph library (yeah, it is confusing: there is one lib with graph algorithms called JGraphT and another, for vizualisation, called JGraph).

In this small tutorial, we will only try to write a textual representation of a tree.

Imagine this simple nexus file:

#NEXUS

Begin TREES;
	tree test1 = (1,2);
	tree test4 = (1:0.1,(2:0.2,3:0.3):0.4);
End;

We just want to draw this:

coder@move-on:~/development/biobug/test$ java Test test1.nex test1
Will process file test1.nex tree test1
p0
  1: 1.0
  2: 1.0
coder@move-on:~/development/biobug/test$ java Test test1.nex test4
Will process file test1.nex tree test4
p0
  1: 0.1
  p1: 0.4
    2: 0.2
    3: 0.3

So, tree 1 is composed of nodes (leaves) 1 and 2 and the inner node which was named p1. Tree 2 has distances.

By the way, we will also want to know which trees are in the file.

Lets start!.

So, we start by loading and parsing the file:

import org.biojavax.bio.phylo.io.nexus.*;
 
[...]
        //file is a String with the name of the file to be processed
        NexusFileBuilder builder = new NexusFileBuilder();
        NexusFileFormat.parseFile(builder, new File(file));
        NexusFile nexus = builder.getNexusFile();

Nexus files have several blocks (Taxa, Data, Tree, Set). We are interested in getting the Tree block, lets do a function for that:

    TreesBlock getTreeNode(NexusFile nexus) {
        Iterator it = nexus.blockIterator();
        NexusBlock block;
        while(it.hasNext()) {
                block = (NexusBlock)it.next();
                if (block.getBlockName().equals("TREES")) {
                        return (TreesBlock)block;
                }
            }
            return null;
    }

We get the nexus block iterator and go through it until we find a block whose name is TREES, and return that block.

No that we have the TREES block, lets print the names of all trees:

    void printTrees(NexusFile nexus) {
            TreesBlock node = getTreeNode(nexus);
            Map trees = node.getTrees();
            Set keys = trees.keySet();
            System.out.println("Trees:");
            for (Object obj : keys) {
                System.out.println(obj);
            }
    }

We get a map, where the key is the name of the tree and the value would be the tree as, essentially, a String based (but with some annotated info) representation (not a graph). Now, given a certain name, lets get the graph:

import org.biojava.bio.seq.io.ParseException;
import org.jgrapht.*;
import org.jgrapht.graph.*;
 
[...]
 
    WeightedGraph<string , DefaultWeightedEdge> getTree(NexusFile nexus, String name)
    throws ParseException {
        String topNode;
        TreesBlock node = getTreeNode(nexus);
        WeightedGraph</string><string , DefaultWeightedEdge> graph = node.getTreeAsWeightedJGraphT(name);
        topNode = node.getTopNode();
        System.out.println("The top node is: " + topNode);
        return graph;
    }
</string>

Note that getTreeAsWeightedJGraphT will do some parsing, so ParsingException can be raised. Note also that the top node name can be retrieved (in the case of tree test1, that will be named p1). Some considerations: You can change the rules to create internal nodes; if there are clashes of names inner nodes will be renamed (not leaves!).

Regarding the top node, we call it top node and not root node. While from a data structure perspective the tree has a root, from a phylogenetic perspective the tree might be rooted or not (in which case being root has no meaning, and it is really just a simple weighted acyclic graph). How to know if the tree is rooted? Remember the function to get all trees (getTrees)? The value of the map has a method called getRootType. So, to know if is rooted, you need to use that function. Not the best design… but at least works.

Ok, now we just need to print a tree…

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
    static void dump(WeightedGraph<string , DefaultWeightedEdge> graph,
            String parent, String node, String depth) {
        Set</string><string> verts = graph.vertexSet();
        String vertex = "";
        for (String candidate : verts) {
            if (candidate.equals(node)) {
                vertex = candidate;
                break;
            }
        }
        System.out.print (depth + vertex);
        if (parent != null) {
            System.out.print (": " + graph.getEdgeWeight(graph.getEdge(parent, vertex)));
        }
        System.out.println();
        for(DefaultWeightedEdge e: graph.edgesOf(vertex)){
            if (graph.getEdgeSource(e).equals(node)) {
                dump(graph, vertex, graph.getEdgeTarget(e), "  "+ depth);
            }
        }
    } 
</string>

Ok, this is the complicated part. Note the following:

  • The complexity has to do with processing graphs
  • dump is a recursive function
  • Node is synonym with Vertex
  • Notice the important bit, if you know that there is a node called “bla”, it is not enough to do graph.containsVertex(“bla”). The answer will probably be false. Remember that one thing is reference (which we have here, i.e. ==) and not content equality (.equals). See below, a remainder
  • Finally we go through all edges referencing the current vertex and choose the ones that start on the current one. Again, if the tree is unrooted, the notion of direction does not apply, but it is still good to do a “tree” traversal

We end here.

Regarding the “equal” issue remember that:

        System.out.println("a" == new String("a"));
        System.out.println("a".equals(new String("a")));

Returns false, true. By this order. This is important when traversing the graph. If you know that the reference is equal (and it is when we getEdgeTarget) than one could use it. If you don’t know (like you pass a String that you have constructed yourself or got from some other place), then one needs to go through the vertex/node list and do a .equals to get the correct vertex.

A small example with all the above, is here, ready to use.

If you search the web you can find some discussions on whether IDEs for dynamic languages can be as helpful as IDEs for static languages. The issue is that static languages like Java have compile-time (thus easy to get at IDE-time) information in order to provide that fundamental code-completion functionality (among many others). If the IDE knows that a certain parameter is a String, than it is simple: it will present to you all the String methods when you type in the dot. For dynamic languages things get more complex are there is formally no (by definition) compile-time information. Some people would argue that there are ways around it (which you can already find in existing IDEs, I remember having some sort of code completion, years ago, on SPE – for Python). I will not add anything to that discussion here, this preamble was mainly for putting the reader in context. I am more interested in discussing good IDEs for DSLs.

With DSLs you get, most of the times, added syntax. Worse than that, you might fall into situations where you have changed (not only added) the initial language syntax; furthermore those syntax changes might even become valid only in runtime (imagine that a method is added to a class that is supplying DSL methods).

One example comes from Ioke and Prolog operator precedence and associativity rules which are changeable (see the previous post). It is not trivial to know if something like 1+2 is even syntactically valid (*). Even if it is syntactically valid things like association rules might change. In languages like Groovy you can add (e.g., through categories) methods to code blocs (from classes that can be dynamically changed). Then there is dynamic dispatching and macros. What is valid in a certain piece of code can be different from what is valid a few lines below. In fact, complete information of what is valid in a certain code block might require code execution. Or, to put in another way, it might be very difficult to have a completely helpful IDE! In this scenario there are 3 considerations that I think are worth being done:

1. One should not be discouraged for not having perfect solutions. Maybe it is not possible to determine all that can be expressed in a certain code block, but sometimes good approximations are enough.
2. On this issue, one good example comes from Prolog: In Prolog, syntax can be changed mainly through the use of the :-o p directive (and through asserts and retracts). The :-o p directive changes operators but is very easy to analyze pre-compilation/interpretation. So, the way DSLs are normally be constructed lend themselves very easily to code analysis which can be used by IDEs. This unfortunately not the case in most real-world languages.
3. It would be cool to have a language where DSL specifications could be automatically used to construct IDEs. The current real-world DSL-able languages (Ruby, Groovy, …) are DSL-enabled through indirect techniques which can be used to build DSLs (Dynamic reception, operator overload, whatever), in fact many of these techniques exist with other objectives than creating DSLs. If there was a declarative and explicit way to create DSLs, that information could be used to inform IDEs on parsing and other issues. An embedded, core way, to explicitly specify DSLs.

(*) I suppose some will see this as an argument for the fact that you can do pretty stupid (or at least unintuitive) things with DSLs. Well, you can do stupid things with everything. The question is not if you can or not, but the extent of bad use cases and how bad uses can creep in easily. Another (interesting) discussion, but not for now.

During my “silent months” (for details see this post) I’ve been developing a simple system to study the spread of of antimalarial drug resistance. It is a “typical” scientific application with a core (which simulates genetic recombination of individuals reproducing) which is computationally very demanding.

As it is common in these scenarios I started by developing a prototype in a high-level, declarative language (in my case Groovy). I was pretty sure that the first solution would be slow as hell, and part of of that slowness would be due to using a “scripting” language (although algorithm complexity is the cause of slowness, changing the language should at least get running times down 1 order of magnitude). The initial solution was in fact slow. So I proceeded to do the usual thing: identify the expensive part (easy in my case) and rewrite that part in Java. My intention was to end up with a typical hybrid system: core, computational intensive code in Java and high-level functions in Groovy, for easy and productive manipulation.

Converting from Groovy to Java is easy, in fact it is too easy: The final Java code was full of Groovyisms: legacy generics code (things like Map<String,List<Integer>>) and strange looking (from a Java perspective) code originating on .each constructs among other things that made the Java code look very strange.

Needless to say, there were not that much speed improvements. In order to improve things I started to try to be sure that the data structures below List<> had the required complexity for my most used operations. Not much improvement. I then decided to completely convert things like List<List<Integer>> to the typical Java int[][]. Spaghetti and semantic chaos followed (just think of the not-so-minor differences in semantics between lists of lists and [][]).

Being a member of the fundamentalist church of refactoring I decided to do the unthinkable: throw the code away and rewrite it from scratch. I would rewrite the whole code, starting from the core in Java in a Java idiomatic way targeting performance. Then, on top of that I would grow a set of Groovy wrappers in order to easily manipulate the said core. Worked perfectly! Actually I am running that code in the background (on a Asus EEE) as I write this.

The (somewhat elusive) lesson that I took from this is that going from prototype to production code, when the fundamental difference is performance, can be cumbersome if the prototype language is too close to the production language (and Groovy and Java and close enough). The temptation to do a line by line code conversion is too good for comfort (I actually did rename the computationally intensive .groovy to .java and translated line by line – feel free to call me silly) and can have very upseting results.

First a personal note: I’ve did not write (or doing any “things on the Internet”) for the best part of 2008. Although part of it was due to a busy schedule, most of it was due to illness (being obsessive-compulsive has some strange consequences). I finally decided to tackle my health issue (which is solvable, at least in my case).

Anyway, I’m still working in computational biology, still working with malaria, and I am still working with Groovy. So… Lets get back to the usual topics…

Before I start, a caveat: “Over-engineering”, as used below, should not be seen as scornful, we all know that traditional OO-languages and libraries try to be as general purpose and deployable in industrial software processes. In that setting, languages and libraries which present themselves in a typical OO-setting are, comprehensibly “over-engineered”.

The so-called scripting languages (for the lack of a better word, lets stick to it) are supposedly more productive than traditional languages (especially in small to medium size projects). Languages like Java are “over-engineered” beasts, seen as general-purpose, “industrial”, heavy-duty. Our beloved scripting languages fit our brain, they are agile, we can be highly productive, write less lines of code, accomplish more, be more declarative.

Can we really? Lets consider a subset of those languages, those like Groovy or Scala which were developed for the JVM (or all languages that were ported to the JVM). One of the pluses of these languages is that they can use the whole JVM ecology of libraries. The problem is that, most of those libraries are developed in a Java mentality (i.e., they are over-engineered). An example:

The fantastic JFreeChart library produces high-quality 2D charts of all kinds. It has all the flexibility that we expect from the typical Java library, you can do everything. In the Groovy landscape there is also a Builder for it, groovychart, of which I am a minor author. But, whenever I want to plot a chart, my first impulse is to use the (also great) matplotlib (CPython based). Why? Because to plot a line chart in matplotib it is 3 lines, which I remember without going to the documentation:

from pylab import *
plot([1,2,3])
show()

Really, it worked at the first attempt.

In groovychart? I am not even sure of the whole process, but it involves starting swing, preparing the dataset, choosing the chart, … And again, I am one of the authors, I do groovycharts everyday, but I still need to go to a template to do something it takes 3 lines in matplotlib. Matplotlib fits my brain, groovychart doesn’t.

While having all the Java libraries at hand is obviously a good thing, there needs to be a “scriptization” of many of those libraries. There is need for interfaces that “fit the brain”. Groovy + JVM libraries is only tackling half of the problem. Even JVM libraries with a Groovy idiom (like groovychart) don’t address the “over-engineering” problem. What is needed, in my view, are wrappers which are not only Groovy-idiomatic but also Groovy-philosophical: they fit the brain (wrappers which allow to plot a simple line chart in 3 lines).

This can actually be seen in Groovy itself for IO and many data structures: Some core Java libraries are well covered and are already available in a “fit-your-brain” interface. Hopefully we will see more interfaces like this for many existing libraries (and less like groovychart, which are only idiomatic wrappers).

PS – Another way to tackle “over-engineering” problems comes from good IDEs and, in fact, the average modern Java IDE goes to great lengths in reducing “over-burden”.

Preamble: In order to understand this post you should know a little bit (a little is enough, that is how much I know) about ExpandoMetaClass and Categories in Groovy.

DSLs that involve existing classes might be a source of long term sorrow. Let me give an example: Imagine that you want to make a small DSL to handle equations, like

x = new Symbol("x")
(2 * x).differentiate(x) //Result is 2

The problem is that the * operator of Numbers doesn’t know how to handle Symbols, therefore an exception would be raised. The obvious solutions as discussed before on mailing lists and blog posts are:

Categories

Categories would solve the problem, but at the expense of polluting the source with things like

use (Something.Category) {
  //code here
}

Not a disaster, but not pretty too…

Talking about disasters…

Expando over Numbers

The idea here would be to change the behavior of Numbers to be able to handle Symbols. Code would be very clean, no need for uses…

As somebody said on the groovy mailing list: This is disaster in the making. The problem is that I change Numbers, then, for another valid reason you change Numbers, somebody else also changes Numbers… This is chaos. Or at least it would make code from different sources potentially not inter operable or exhibiting very strange, buggy, behavior. This is clearly akin to the “global variable” problem. I believe that in the long term and with big software projects, this approach is a dead end.

Enter Python

Python actually has a workaround (I will not call it a clear, beautiful solution) that might be somewhat useful here. Imagine that you do

1 + x

The default 1 (default class for number) is not able to handle the symbol. For python that is OK, it will try to call a “right add” method of x (Search for __radd__ in this page). So, the default behavior is not to raise an exception if the left object cannot handle the operator, but to try to call the “right” version on the right object (if it fails then raise).

Not perfect, but might be just enough to avoid Expando in anger.

I do believe that people still don’t appreciate the consequences of Expanding core classes and the interop disaster that that can entail.

I tried to drill down the Groovy performance issue that I had with what is in practice a text processing exercise.

The original code was written in Groovy (and then ported to Java, not the other way around), but as I was in a hurry it was written in idiomatic Java (I am too much of a Groovy newbie to be able to write in idiomatic Groovy if I am in a hurry). Ted Naleid left some great suggestions on how to be more groovyish.

Anyway, I took my original code and tried to understand what was going on, here are my findings.

Replacing duck typing with explicit typing took a minute out (from 4m to 3m).

Converting this

iCase.each {
    if (jCase.contains(it)) {
        isDifferent = false
    }
}

to this

if (jCase.contains(iCase[0]) || jCase.contains(iCase[1])) {
    isDifferent = false
}

took 1m10s (from 3m to 1m50s) – This is in a inner loop part.

8 seconds were gained by changing this:

for (int j=i+1; j<indivs .size(); j++) {

into

int iSize = indivs.size()
for (int j=i+1; j<isize ; j++) {

As inline comments you can find how much time each line took in the
following inner loop:

for (int j=i+1; j<isize ; j++) {
    int jPos = indivPos[indivs[j]] //~20s
    String jCase = lineTok[jPos] //~10s
    if (jCase.equals('NN')) continue //~8s
    boolean isDifferent = true //2s
    if (jCase.contains(iCase[0]) || jCase.contains(iCase[1])) {
        isDifferent = false //7s
    } //whole if is ~ 30s  - 23 condition, 7 assignment
    if (isDifferent) counts[indivs[i] + indivs[j]] += 1 //5 secs
}

The only stunning thing is the time lost at indexing String arrays (and maps, but that I can understand).

This text is being written as I was changing and trying things, I gained 20s from
minor changes of which I lost track :) . I am currently at 1m30s (down from the
original 4m and comparing with Java’s 4s).

I think that language performance (from a speed point of view) is highly overrated. There are many factors that are more important. Well, on the time front alone, developer time is normally more important: The time spent groking code is normally more expensive than the running time of an application. Of course, there are many other important points, too much to enumerate (portability, declarativeness, readability, …). If performance was the fundamental variable, we would all be using assembler.

I am currently doing a bit of code to go over tens of millions of lines of text, while comparing separate columns.

I did a little piece of Groovy code to go over all those lines. The performance results were abysmal, so I decided to do a program in Java (copying the Groovy code to a Java file and converting in a very direct way). For 3000 lines of text here are the results (remember, this is to process hundreds of millions):

$ time java Do

real    0m4.427s
user    0m4.384s
sys     0m0.040s

$ time groovy do.groovy

real    2m53.303s
user    2m47.650s
sys     0m0.668s

4 seconds against 2mins 53 seconds. This is not serious as it is possible to write all Groovy intensive parts in Java. But, even so, it is too much.

The code? (Afterwards, some speculation and profiling)

Groovy:

...
while ((line = reader.readLine()) != null) {
    lineTok = line.tokenize()
    if (lineTok.size() == colIndiv.size() + 11) {
        for (int i=0; i<indivs .size()-1; i++) {
            int iPos = indivPos[indivs[i]]
            def iCase = lineTok[iPos]
            if (iCase.equals('NN')) continue
            for (int j=i+1; j<indivs.size(); j++) {
                int jPos = indivPos[indivs[j]]
                def jCase = lineTok[jPos]
                if (jCase.equals('NN')) continue
                def isDifferent = true
                iCase.each {
                    if (jCase.contains(it)) {
                        isDifferent = false
                    }
                }
                if (isDifferent) counts[indivs[i] + indivs[j]] += 1
            }
        }
    }
}

Java:

...
while ((line = reader.readLine()) != null) {
    String[] lineTok = line.split(" ");
    if (lineTok.length == colIndiv.size() + 11) {
        for (int i=0; i<indivs .size()-1; i++) {
            int iPos = indivPos.get(indivs.get(i));
            String iCase = lineTok[iPos];
            if (iCase.equals("NN")) continue;
            for (int j=i+1; j<indivs.size(); j++) {
                int jPos = indivPos.get(indivs.get(j));
                String jCase = lineTok[jPos];
                if (jCase.equals("NN")) continue;
                boolean isDifferent = true;
                if (jCase.indexOf(iCase.charAt(0)) > -1 ||
                    jCase.indexOf(iCase.charAt(1)) > -1 ) {
                    isDifferent = false;
                }
                if (isDifferent) counts.put(
                    indivs.get(i) + indivs.get(j),
                    indivs.get(i) + indivs.get(j) + 1);
            }
        }
    }
}
</indivs>

I cursorly run the Java profiler, though I did not spent much time on it, it seemed that (speculation alert!) Groovy was spending sometime on metaclassing/proxying parts. I wonder if the “defs” were making things much slower? Maybe if I had properly typed my loop variables (instead of being lazy and def duck typing) things would have ran smoother. If that is the case, then one more reason against duck typing (others being helping the IDEs and automated code tools and for debugging purposes)

There seems to be some competition in the field that can be vaguely defined as “The next Java”(TM).

I don’t know if there will be a “next Java” to start with. Things seem to shape up in way where the JVM is our common interoperability platform and on top of it we have a an ecology of JVM based languages.

I have used Jython quite a lot but have several doubts about it, not only on the current status of Jython (lags a bit behind CPython) but I also deslike Python (when compared with the other languages dicussed here). As such I decided to evaluate the other Scala, Ruby and Groovy.

I have done a couple of small projects in Scala (A prototype DSL for modeling malaria resistance is available here) and JRuby. I am now starting with Groovy, and I think I’ve found my new love. Here I will try to explain why, among Groovy, Scala and JRuby, I have chosen Groovy. To preempt any religious war idea, I would like to say I have full respect for Scala, Ruby, which are, with Caml and Prolog among my favorite languages (for a true crusade and flame ask me for my opinion about Perl or Visual Basic 6 ;) ).

Steven Devijver suggests that Groovy is the language with more syntatic similarities with Java. I would say that, not only that, but on the semantics and everything, Groovy is the closest language to Java. And that is a good thing. The world (both in programming languages and all the rest) is never revolutionary. Revolutions, when they rarely happen, are either a disgrace or are not that much of big change below the surface. People normally prefer (for good and bad reasons) the path of least short term pain. Groovy delivers that: almost 0 cost in starting to code coming from a Java background. Most importantly Groovy does that but still delivers most of the new goodies. This is actually the cornerstone of my argument: path of least pain while delivering the good stuff (in some cases better than the competition, as we will see).

Let me start with the fundamental reasons why I dismiss JRuby (which is, nonetheless, my second option after Groovy). First, I would like to say, very honestly, that the work of the JRuby guys is nothing short of outstanding! But I have 3 problems:

  1. One, by definition, JRuby is based on Ruby, a language from outside the JVM. That means semantic hurdles, coupling issues between the two worlds (think, e.g., libraries)
  2. Most importantly (but connected with the first point): Typing. I am a bit far away from computing issues currently (I work with Malaria currently, so excuse me if I mess strong/explicit typing and such) but clearly the typing system of Ruby make like hard for IDEs (think IDEs to neded to tame those over engineered Java APIs) and automated tools around code. Debugging without explicit typing is also a pain in a big program (I actually suffered my first debug nightmare with typing systems with Caml, arguably the mother of Scala). Some might say that Scala type inference and Groovy duck typing also are problematic in this respect; while the argument might be correct both languages have mechanisms to support typical Java explicit/strong typing and as such profit from IDEs and automated analysis tools.
  3. Ugly perlisms. Although I have read somewhere that those might be deprecated in the future.

Ah… Scala… Mats Henricson argues that Scala is the only option because of elegance regarding multicore computing. I fundamentally disagree with his point – multicore programming is fundamental but Scala is not really a good solution, but before we get there, lets talk about other Scala issues.

Type inference. I have some experience with the “mother” of Scala, Caml. Type inference in Caml is really elegant: I don’t remember a single case of it failing and requiring the programmers’ help in discovering the type of a parameter. That is not the case with Scala, several times the compiler seems to be “lost in translation”. Some might say that this is because of JVM imposed constraints, but if that is the case then it would raise the argument of bringing a language with a foreign semantics to the JVM and the ugliness attached to the process.

My biggest peeve? Metaprogramming. I won’t give you my opinion about it because it really doesn’t exist. It is on the Scala wiki in the section “future”. I am sorry, but a 21st century language where meta programming is absent can only be called in “beta stage”. As a side note, there seems to be something lost in the ML branch of functional programming from Lisp in this regard (no introspection and such), that is a shame (How is Haskell in that respect?).

Ok, multicore computing. This is an area where I have some experience in the JVM: [Shameless plug] I invite you to have a look at my Java Web Start, Jython based, multicore aware evolutionary biology workbench LOSITAN. Furthermore I have written tutorials for the multicore paradigm and bioinformatics:

Bioinformatics, multi-core CPUs and grid computing: Introduction (1/4)


Bioinformatics, multi-core CPUs and grid computing: User perspective (2/4)

Most importantly in this context: Bioinformatics, multi-core CPUs and grid computing: developer perspective (3/4)

Mats argues that Scala Actors and immutable data types provide a simple and elegant solution to the extremely complex problem (I am calling it extremely complex, because I think it really is) of concurrent programming. Immutable data types… Does anyone believe that the hordes of existing Java developers/programmers are ready and willing to do radical conceptual jump to immutable data types? The change from C++ to Java was minor in terms of semantics, even the change from C to C++ was much less radical that a change requiring to “get rid of all variables”. How do you think the majority of programmers will react when you say: “Forget variables”? More, as Scala allows for imperative type of programming, what do you think most programmers idiom wil be: Imperative or functional? To makes things worse, in Scala a immutable is called a “val” and the mutable a “var”. Am I the only only picturing hordes of developers, with tight deadlines just swapping L’s for R’s?

I speak for myself here: in spite of having probably more experience with “immutable” languages (Prolog a lot, Caml a bit) than most developers, when I wrote Scala code, my reasoning was so tainted by “real world” imperative languages that it was really hard to write in a functional dialect. I have the background, enough free time, and the motivation to write functional code, but it was hard to get back in that mindset.

Scala only apparently solves the multi core problem. Give it to a typical developer and he will write imperative code, unless you put a functional zealot behind him (and give the said zealot a strong, resistant whip).

How to address the multicore issue? Clearly we have a problem here. A few ideas:

  • In many applications there is no big need to go multicore. In some cases lets not try to solve a problem that doesn’t exist in the first place.
  • Many multicore applications can survive very well with simple concurrency management. Not all applications require a PhD in concurrent programming.
  • Scala and the like. For those who can and are willing to go functional, why not? I have nothing against that. My only argument is that it won’t be mainstream.
  • The way of PAIN. Most developers will continue to use old languages and paradigms and SUFFER with it. Only after much suffering there will be motivation to try out new things and, say, endure the pain of learning a new paradigm. That suffering still hasn’t happen, only after this becomes a big problem, there will be interest in accepting new solutions.
  • A silver bullet that can be attached to the current programming paradigm. Sometimes it happens. Don’t misunderestimate (silly Bushism intended) the power of a “Black Swan” (A reference to Taleb’s book where he discusses the impact of the unexpected important events).

To finalize, I would like to say that I am not sticking with Groovy out of being conservative. Groovy seems to beat the competition in many areas (the biggest example is metaprogramming) and strikes a very good balance between being a “small evolutionary step” and delivering the goodies.

To really finalize, a caveat: my Groovy knowledge is still limited, one of these days you might read a post where I apologize for having written this ;)

Originally posted on Perfect Storm