March 23rd, 2008 tiago
I tried to drill down the Groovy performance issue that I had with what is in practice a text processing exercise.
The original code was written in Groovy (and then ported to Java, not the other way around), but as I was in a hurry it was written in idiomatic Java (I am too much of a Groovy newbie to be able to write in idiomatic Groovy if I am in a hurry). Ted Naleid left some great suggestions on how to be more groovyish.
Anyway, I took my original code and tried to understand what was going on, here are my findings.
Replacing duck typing with explicit typing took a minute out (from 4m to 3m).
Converting this
iCase.each {
if (jCase.contains(it)) {
isDifferent = false
}
}
to this
if (jCase.contains(iCase[0]) || jCase.contains(iCase[1])) {
isDifferent = false
}
took 1m10s (from 3m to 1m50s) - This is in a inner loop part.
8 seconds were gained by changing this:
for (int j=i+1; j<indivs .size(); j++) {
into
int iSize = indivs.size()
for (int j=i+1; j<isize ; j++) {
As inline comments you can find how much time each line took in the
following inner loop:
for (int j=i+1; j<isize ; j++) {
int jPos = indivPos[indivs[j]] //~20s
String jCase = lineTok[jPos] //~10s
if (jCase.equals('NN')) continue //~8s
boolean isDifferent = true //2s
if (jCase.contains(iCase[0]) || jCase.contains(iCase[1])) {
isDifferent = false //7s
} //whole if is ~ 30s - 23 condition, 7 assignment
if (isDifferent) counts[indivs[i] + indivs[j]] += 1 //5 secs
}
The only stunning thing is the time lost at indexing String arrays (and maps, but that I can understand).
This text is being written as I was changing and trying things, I gained 20s from
minor changes of which I lost track
. I am currently at 1m30s (down from the
original 4m and comparing with Java’s 4s).
Share and Enjoy:
These icons link to social bookmarking sites where readers can share and discover new web pages.
Posted in groovy, java | 10 Comments »
March 21st, 2008 tiago
I think that language performance (from a speed point of view) is highly overrated. There are many factors that are more important. Well, on the time front alone, developer time is normally more important: The time spent groking code is normally more expensive than the running time of an application. Of course, there are many other important points, too much to enumerate (portability, declarativeness, readability, …). If performance was the fundamental variable, we would all be using assembler.
I am currently doing a bit of code to go over tens of millions of lines of text, while comparing separate columns.
I did a little piece of Groovy code to go over all those lines. The performance results were abysmal, so I decided to do a program in Java (copying the Groovy code to a Java file and converting in a very direct way). For 3000 lines of text here are the results (remember, this is to process hundreds of millions):
$ time java Do
real 0m4.427s
user 0m4.384s
sys 0m0.040s
$ time groovy do.groovy
real 2m53.303s
user 2m47.650s
sys 0m0.668s
4 seconds against 2mins 53 seconds. This is not serious as it is possible to write all Groovy intensive parts in Java. But, even so, it is too much.
The code? (Afterwards, some speculation and profiling)
Groovy:
...
while ((line = reader.readLine()) != null) {
lineTok = line.tokenize()
if (lineTok.size() == colIndiv.size() + 11) {
for (int i=0; i<indivs .size()-1; i++) {
int iPos = indivPos[indivs[i]]
def iCase = lineTok[iPos]
if (iCase.equals('NN')) continue
for (int j=i+1; j<indivs.size(); j++) {
int jPos = indivPos[indivs[j]]
def jCase = lineTok[jPos]
if (jCase.equals('NN')) continue
def isDifferent = true
iCase.each {
if (jCase.contains(it)) {
isDifferent = false
}
}
if (isDifferent) counts[indivs[i] + indivs[j]] += 1
}
}
}
}
Java:
...
while ((line = reader.readLine()) != null) {
String[] lineTok = line.split(" ");
if (lineTok.length == colIndiv.size() + 11) {
for (int i=0; i<indivs .size()-1; i++) {
int iPos = indivPos.get(indivs.get(i));
String iCase = lineTok[iPos];
if (iCase.equals("NN")) continue;
for (int j=i+1; j<indivs.size(); j++) {
int jPos = indivPos.get(indivs.get(j));
String jCase = lineTok[jPos];
if (jCase.equals("NN")) continue;
boolean isDifferent = true;
if (jCase.indexOf(iCase.charAt(0)) > -1 ||
jCase.indexOf(iCase.charAt(1)) > -1 ) {
isDifferent = false;
}
if (isDifferent) counts.put(
indivs.get(i) + indivs.get(j),
indivs.get(i) + indivs.get(j) + 1);
}
}
}
}
I cursorly run the Java profiler, though I did not spent much time on it, it seemed that (speculation alert!) Groovy was spending sometime on metaclassing/proxying parts. I wonder if the “defs” were making things much slower? Maybe if I had properly typed my loop variables (instead of being lazy and def duck typing) things would have ran smoother. If that is the case, then one more reason against duck typing (others being helping the IDEs and automated code tools and for debugging purposes)
Share and Enjoy:
These icons link to social bookmarking sites where readers can share and discover new web pages.
Posted in groovy, java | 8 Comments »
March 11th, 2008 tiago
Cognitive Consonance is a spin off of my original blog, Perfect Storm. Cognitive Consonance is about programming (for science Perfect Storm is still the place to go) from a declarative and agile perspective (high-level programming, domain specific languages, efficiency). I mainly develop under the JVM in Java and Groovy (though Scala and Ruby are also favorites). As I am the developer of the Biopython population genetics’ module, some Python content will appear.
Time and patience permitting I will be cross posting relevant old content from Perfect Storm. Future posts that are relevant to both blogs will also be cross posted.
Share and Enjoy:
These icons link to social bookmarking sites where readers can share and discover new web pages.
Posted in meta | No Comments »