Categories
Mathematics

PhyloFun

In biology, it can be important to know the evolutionary history of a set of species or other types of biological groups, called taxa. One example of the use of these histories is in epidemiology, where the original source of a virus–say, the corona virus–can be found by constructing the evolutionary history of this virus and a few related viruses. Roughly speaking, the source of the disease can then be traced back to the related virus strain that is most closely related to our virus of interest.

To visualize an evolutionary history, it is common to use a phylogenetic tree. This is a drawing of a branching pattern, where the tips represent the current day taxa. Such trees are already centuries old. Indeed, one of the most famous examples is by the hand of Charles Darwin.

Darwin’s Tree (1837)

One issue with these trees is that they are unable to represent some important evolutionary events, such as horizontal gene transfer and hybridization. In these events, hereditary information is not transferred from parent to child, but between unrelated individuals. These events make the evolutionary history into a jumbled mess of interactions, which cannot be represented by an orderly branching tree. This is where phylogenetic networks come in. These are drawings where the branches are allowed to re-converge after they have split off.

Building the trees

Nowadays, phylogenetic trees are constructed on a daily basis, and there are multiple ways to do this. These methods do have a lot in common, though. They generally start with a set of DNA sequences, one for each of the studied taxa/species. These sequences–long strings consisting of the letters ACTG–are then aligned (try this yourself here). This means they are put next to each other so that parts that corresponding parts of the sequences are put together (this relies on the fact that sequences of different species can be quite similar). Finally, this alignment can be used to give a score to a given candidate phylogenetic tree, where the tree that has the highest score will be assumed to represent the true evolutionary history.

A sequence alignment, taken from https://www.bioss.ac.uk/people/dirk/talks/lectureDougArm0203.pdf.

This may sound straightforward, but there is a big problem! It is nearly impossible to find the tree with the highest score, and it is even harder for networks. This means we need to use tricks to find a high scoring tree or network. One of these tricks is to simply start with an arbitrary network, and then modify it so it becomes a bit better. Then, we take the modified network and change it again, to make it even better. By doing this for long enough, the hope is that we find a very good network (maybe not the best one) that also represents the evolutionary history very accurately.

Modifying trees and networks

The most common type of change made to phylogenetic networks when looking for a good network, is called a rearrangement move. This is an operation that moves one endpoint of a line of the network to another location in the network. If the starting point of the line is changed, then this is called a tail move, and if the endpoint is changed, it is called a head move.

A phylogenetic network and two networks obtained by making small changes to this network. A rearrangement move is applied to the orange line. On the left, we apply a tail move, and, on the right, a head move.

To find good networks, the changes to the tree or network in each step cannot be too large nor too small. If the changes are too large, then we could just as well try an arbitrary new network, instead of making changes to the current one. And if the changes are too small, it may take ages to reach a good network, or it may not even be possible at all. This can happen because the changes are not powerful enough to change the whole structure of the network, but only parts of it. It is thus important to ask whether rearrangement moves are able to transform each network into any other network.

Can you see how to change the first network into the other using head moves and tail moves? Can you also do it with only head moves, or with only tail moves?

For trees, this question has been answered positively, and the mathematical proof is quite simple. For networks, however, the question is much harder to answer, and research into this question has surged recently (e.g., see my research). In general, it seems that the question can also be answered positively for networks, but with a few more caveats than for trees. For example, the two networks that resemble the letter A in the figure above can be changed into one another using head moves, but not by only using tail moves.

Although we know that it is possible to change each network into any other network, we may not know how to do this and how many moves we need minimally. This leads to another question: can we efficiently compute the shortest sequence of moves between any pair of networks? Research for this question is academic uncharted territory, we know nearly nothing about this yet. It seems to be a very hard problem, and it will be exciting to see where new research will take us here.

Try it yourself!

If you want to try this problem yourself, you can start by solving the puzzles in my game Phylofun. This game is based on exactly this research, and you are asked to find short sequences of moves between two given networks. This game has also been a valuable research tool, which led to new techniques and questions regarding rearrangement moves.