BioJava has a parser for the nexus file format. While Nexus supports quite a lot of information (like DNA sequences), the most complex part to process is the phylogenetic tree descriptions based on the Newick format. Below you can find a small tutorial on how to process these.
BioJava’s parser relies on JGraphT to create a representation of the “tree”. The tree is actually really more an acyclic graph than a tree, though some trees are rooted (and therefore trees in the proper sense). Manipulating the JGraphT weighted graph is the complicated part, not really the BioJava interface. Note that JGraphT objects can be easily rendered using the JGraph library (yeah, it is confusing: there is one lib with graph algorithms called JGraphT and another, for vizualisation, called JGraph).
In this small tutorial, we will only try to write a textual representation of a tree.
Imagine this simple nexus file:
#NEXUS
Begin TREES;
tree test1 = (1,2);
tree test4 = (1:0.1,(2:0.2,3:0.3):0.4);
End;
We just want to draw this:
coder@move-on:~/development/biobug/test$ java Test test1.nex test1
Will process file test1.nex tree test1
p0
1: 1.0
2: 1.0
coder@move-on:~/development/biobug/test$ java Test test1.nex test4
Will process file test1.nex tree test4
p0
1: 0.1
p1: 0.4
2: 0.2
3: 0.3
So, tree 1 is composed of nodes (leaves) 1 and 2 and the inner node which was named p1. Tree 2 has distances.
By the way, we will also want to know which trees are in the file.
Lets start!.
So, we start by loading and parsing the file:
import org.biojavax.bio.phylo.io.nexus.*;
[...]
//file is a String with the name of the file to be processed
NexusFileBuilder builder = new NexusFileBuilder();
NexusFileFormat.parseFile(builder, new File(file));
NexusFile nexus = builder.getNexusFile();
Nexus files have several blocks (Taxa, Data, Tree, Set). We are interested in getting the Tree block, lets do a function for that:
TreesBlock getTreeNode(NexusFile nexus) {
Iterator it = nexus.blockIterator();
NexusBlock block;
while(it.hasNext()) {
block = (NexusBlock)it.next();
if (block.getBlockName().equals("TREES")) {
return (TreesBlock)block;
}
}
return null;
}
We get the nexus block iterator and go through it until we find a block whose name is TREES, and return that block.
No that we have the TREES block, lets print the names of all trees:
void printTrees(NexusFile nexus) {
TreesBlock node = getTreeNode(nexus);
Map trees = node.getTrees();
Set keys = trees.keySet();
System.out.println("Trees:");
for (Object obj : keys) {
System.out.println(obj);
}
}
We get a map, where the key is the name of the tree and the value would be the tree as, essentially, a String based (but with some annotated info) representation (not a graph). Now, given a certain name, lets get the graph:
import org.biojava.bio.seq.io.ParseException;
import org.jgrapht.*;
import org.jgrapht.graph.*;
[...]
WeightedGraph<string , DefaultWeightedEdge> getTree(NexusFile nexus, String name)
throws ParseException {
String topNode;
TreesBlock node = getTreeNode(nexus);
WeightedGraph</string><string , DefaultWeightedEdge> graph = node.getTreeAsWeightedJGraphT(name);
topNode = node.getTopNode();
System.out.println("The top node is: " + topNode);
return graph;
}
</string>
Note that getTreeAsWeightedJGraphT will do some parsing, so ParsingException can be raised. Note also that the top node name can be retrieved (in the case of tree test1, that will be named p1). Some considerations: You can change the rules to create internal nodes; if there are clashes of names inner nodes will be renamed (not leaves!).
Regarding the top node, we call it top node and not root node. While from a data structure perspective the tree has a root, from a phylogenetic perspective the tree might be rooted or not (in which case being root has no meaning, and it is really just a simple weighted acyclic graph). How to know if the tree is rooted? Remember the function to get all trees (getTrees)? The value of the map has a method called getRootType. So, to know if is rooted, you need to use that function. Not the best design… but at least works.
Ok, now we just need to print a tree…
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
| static void dump(WeightedGraph<string , DefaultWeightedEdge> graph,
String parent, String node, String depth) {
Set</string><string> verts = graph.vertexSet();
String vertex = "";
for (String candidate : verts) {
if (candidate.equals(node)) {
vertex = candidate;
break;
}
}
System.out.print (depth + vertex);
if (parent != null) {
System.out.print (": " + graph.getEdgeWeight(graph.getEdge(parent, vertex)));
}
System.out.println();
for(DefaultWeightedEdge e: graph.edgesOf(vertex)){
if (graph.getEdgeSource(e).equals(node)) {
dump(graph, vertex, graph.getEdgeTarget(e), " "+ depth);
}
}
}
</string> |
Ok, this is the complicated part. Note the following:
- The complexity has to do with processing graphs
- dump is a recursive function
- Node is synonym with Vertex
- Notice the important bit, if you know that there is a node called “bla”, it is not enough to do graph.containsVertex(“bla”). The answer will probably be false. Remember that one thing is reference (which we have here, i.e. ==) and not content equality (.equals). See below, a remainder
- Finally we go through all edges referencing the current vertex and choose the ones that start on the current one. Again, if the tree is unrooted, the notion of direction does not apply, but it is still good to do a “tree” traversal
We end here.
Regarding the “equal” issue remember that:
System.out.println("a" == new String("a"));
System.out.println("a".equals(new String("a")));
Returns false, true. By this order. This is important when traversing the graph. If you know that the reference is equal (and it is when we getEdgeTarget) than one could use it. If you don’t know (like you pass a String that you have constructed yourself or got from some other place), then one needs to go through the vertex/node list and do a .equals to get the correct vertex.
A small example with all the above, is here, ready to use.