Posts tagged ‘phylogenetics’

BioJava has a parser for the nexus file format. While Nexus supports quite a lot of information (like DNA sequences), the most complex part to process is the phylogenetic tree descriptions based on the Newick format. Below you can find a small tutorial on how to process these.

BioJava’s parser relies on JGraphT to create a representation of the “tree”. The tree is actually really more an acyclic graph than a tree, though some trees are rooted (and therefore trees in the proper sense). Manipulating the JGraphT weighted graph is the complicated part, not really the BioJava interface. Note that JGraphT objects can be easily rendered using the JGraph library (yeah, it is confusing: there is one lib with graph algorithms called JGraphT and another, for vizualisation, called JGraph).

In this small tutorial, we will only try to write a textual representation of a tree.

Imagine this simple nexus file:

#NEXUS

Begin TREES;
	tree test1 = (1,2);
	tree test4 = (1:0.1,(2:0.2,3:0.3):0.4);
End;

We just want to draw this:

coder@move-on:~/development/biobug/test$ java Test test1.nex test1
Will process file test1.nex tree test1
p0
  1: 1.0
  2: 1.0
coder@move-on:~/development/biobug/test$ java Test test1.nex test4
Will process file test1.nex tree test4
p0
  1: 0.1
  p1: 0.4
    2: 0.2
    3: 0.3

So, tree 1 is composed of nodes (leaves) 1 and 2 and the inner node which was named p1. Tree 2 has distances.

By the way, we will also want to know which trees are in the file.

Lets start!.

So, we start by loading and parsing the file:

import org.biojavax.bio.phylo.io.nexus.*;
 
[...]
        //file is a String with the name of the file to be processed
        NexusFileBuilder builder = new NexusFileBuilder();
        NexusFileFormat.parseFile(builder, new File(file));
        NexusFile nexus = builder.getNexusFile();

Nexus files have several blocks (Taxa, Data, Tree, Set). We are interested in getting the Tree block, lets do a function for that:

    TreesBlock getTreeNode(NexusFile nexus) {
        Iterator it = nexus.blockIterator();
        NexusBlock block;
        while(it.hasNext()) {
                block = (NexusBlock)it.next();
                if (block.getBlockName().equals("TREES")) {
                        return (TreesBlock)block;
                }
            }
            return null;
    }

We get the nexus block iterator and go through it until we find a block whose name is TREES, and return that block.

No that we have the TREES block, lets print the names of all trees:

    void printTrees(NexusFile nexus) {
            TreesBlock node = getTreeNode(nexus);
            Map trees = node.getTrees();
            Set keys = trees.keySet();
            System.out.println("Trees:");
            for (Object obj : keys) {
                System.out.println(obj);
            }
    }

We get a map, where the key is the name of the tree and the value would be the tree as, essentially, a String based (but with some annotated info) representation (not a graph). Now, given a certain name, lets get the graph:

import org.biojava.bio.seq.io.ParseException;
import org.jgrapht.*;
import org.jgrapht.graph.*;
 
[...]
 
    WeightedGraph<string , DefaultWeightedEdge> getTree(NexusFile nexus, String name)
    throws ParseException {
        String topNode;
        TreesBlock node = getTreeNode(nexus);
        WeightedGraph</string><string , DefaultWeightedEdge> graph = node.getTreeAsWeightedJGraphT(name);
        topNode = node.getTopNode();
        System.out.println("The top node is: " + topNode);
        return graph;
    }
</string>

Note that getTreeAsWeightedJGraphT will do some parsing, so ParsingException can be raised. Note also that the top node name can be retrieved (in the case of tree test1, that will be named p1). Some considerations: You can change the rules to create internal nodes; if there are clashes of names inner nodes will be renamed (not leaves!).

Regarding the top node, we call it top node and not root node. While from a data structure perspective the tree has a root, from a phylogenetic perspective the tree might be rooted or not (in which case being root has no meaning, and it is really just a simple weighted acyclic graph). How to know if the tree is rooted? Remember the function to get all trees (getTrees)? The value of the map has a method called getRootType. So, to know if is rooted, you need to use that function. Not the best design… but at least works.

Ok, now we just need to print a tree…

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
    static void dump(WeightedGraph<string , DefaultWeightedEdge> graph,
            String parent, String node, String depth) {
        Set</string><string> verts = graph.vertexSet();
        String vertex = "";
        for (String candidate : verts) {
            if (candidate.equals(node)) {
                vertex = candidate;
                break;
            }
        }
        System.out.print (depth + vertex);
        if (parent != null) {
            System.out.print (": " + graph.getEdgeWeight(graph.getEdge(parent, vertex)));
        }
        System.out.println();
        for(DefaultWeightedEdge e: graph.edgesOf(vertex)){
            if (graph.getEdgeSource(e).equals(node)) {
                dump(graph, vertex, graph.getEdgeTarget(e), "  "+ depth);
            }
        }
    } 
</string>

Ok, this is the complicated part. Note the following:

  • The complexity has to do with processing graphs
  • dump is a recursive function
  • Node is synonym with Vertex
  • Notice the important bit, if you know that there is a node called “bla”, it is not enough to do graph.containsVertex(“bla”). The answer will probably be false. Remember that one thing is reference (which we have here, i.e. ==) and not content equality (.equals). See below, a remainder
  • Finally we go through all edges referencing the current vertex and choose the ones that start on the current one. Again, if the tree is unrooted, the notion of direction does not apply, but it is still good to do a “tree” traversal

We end here.

Regarding the “equal” issue remember that:

        System.out.println("a" == new String("a"));
        System.out.println("a".equals(new String("a")));

Returns false, true. By this order. This is important when traversing the graph. If you know that the reference is equal (and it is when we getEdgeTarget) than one could use it. If you don’t know (like you pass a String that you have constructed yourself or got from some other place), then one needs to go through the vertex/node list and do a .equals to get the correct vertex.

A small example with all the above, is here, ready to use.

Here is my first clojure application: A phylogenetic tree viewer (PhyloXML format). The obligatory screenshot:

Simple Phylogenetic viewer

Simple Phylogenetic viewer

Preamble:

  1. This is newbie code: Handle with care! My main objective is not to make a tree viewer but a tree comparer. So this is no more than a learning step.
  2. You can test it yourself as it is a Java WebStart application, just click here. You don’t need to have a phylogenetic tree file yourself. I supply an example inside.
  3. This makes use of JGraph and Archaeopteryx (the PhyloXML parser)

I do maintain this code on github. I have one project for the viewer and another for general utilities. All the code is still very crude, but you might be interesting in stealing some of the swing code, either as a crude example of how to interact with swing or taking my micro-DSL for menus. If you want to interact with JGraph, this might be a starting point. I don’t want, in any way, suggest that this code is any good.

Some lessons that I’ve learned and that I would like to share:

  1. Some of the clojure.contrib code is a bit green. I tried to use the graph library, but it is very small and specific. I ended up starting doing my own. Mine is even smaller and specific, no claims of generality.
  2. I don’t appreciate some of the core functions of Clojure (I’ve written on this before and will write more in the near future). The great thing about Clojure is that you can import only what you want from the core and extend it yourself. I intend to do just that for my personal use. This is a PLUS point for Clojure: the flexibility that is made available to change many of the decisions of the language implementor (in the great tradition of declarative and homoiconic languages)
  3. While I can change the core for my uses, I think defnk should really be core for everybody! I fact I wander if defn should not become defnk…
  4. I am pretty sure that when *warn-on-reflection* is activated and action taken to correct the warnings, lots of code will increase in performance. With the more important side effect of annotating the code with type info.
  5. I have quite a lot of recursive code that doesn’t use recur. Something to learn and master…
  6. JGraph layout algorithms are not fantastic. I’ve tried with much bigger trees and the result was far from perfect (I also noticed performance problems in my own code).

The biggest hurdle that I’ve found was the construction of user interfaces and how verbose Clojure Java interop can become. Of course one can create functions (and that was done) to create buttons, frames, menus, etc. But the creation of Java container structures (think frame contains menubar which contains menu with menus inside and so on) would benefit from a dialect where, when a certain (container) object was created it’s (Java) namespace would become easily available.

Imagine constructing a Structure like this:

MenuBar[
    Menu(File) [
        Menu(New)
        Menu(Close)
        Separator
        Menu
    ]
    Menu(Edit) [
        Menu(Cut)
        Menu(Paste)
    ]
]

it would be nice to be able to write something like:

((new JMenuBar)
   (add ((new JMenu "File")
       (add (new JMenu "New"))
       (add (new JMenu "Close"))
       (addSeparator)
    )
   (add ((new JMenu "Edit")
       (add (new JMenu "Cut"))
       (add (new JMenu "Paste"))
    )
)

“add” and “addSeparator” are Java methods. All this would be dynamic against the Java object hierarchy (not a hand-written library!). Note that there is no doto special form (or variants) and, most importantly, note that, given a list (a b c d), if a is a Java object b c d are evaluated as methods of a. If b is (i (y x s)), x and s would be evaluated as methods of y, if they failed then as methods of a, if this fails interpreted as normal Clojure. Something like this (rough sketch).
This would be useful, e.g., to construct Swing hierarchies by hand in a expedient way (not suggesting anything more, especially not to do big programs with outside scope).
I am going to try to write some code that does this in the next few days.