Answering Camin’s Challenge: The Comparison of Known and Inferred Phylogenies

 

David B. Hull, Ph.D.

DeVry Institute of Technology

Pomona, CA

12/13/98

Unfortunately these pages work best under Microsoft's I.E. browser

Author's Address:

David B. Hull,  DeVry Institute of Technology,  901 Corporate Drive,  Pomona, CA 91768

E-mail:  dhull@admin.pom.devry.edu.

The full text of this paper is available at http://minerva.pom.devry.edu/~dhull/aaas/

 

 

Introduction:

Classification is one of the fundamental concerns in evolutionary biology. Much time and energy has been spent on creating “natural” classifications inferred from a wide variety of observed characters The validity of the methods used to create these classifications are rarely tested, although often argued (e.g Hennig, 1966 and Sneath and Sokal, 1973). Since the actual phylogeny of the organisms being classified is almost always unknown (and probably unknowable); several techniques have been used to validate the inferred phylogeny.

Fossil material, when available, can be fitted into the inferred phylogeny. This provides both an absolute time line, and constraints on the possible phylogenetic reconstructions.  Fossil material, is, however, incomplete in both quantity and the types of traits available. Furthermore, the relationships of the fossil material is often itself the subject of heated debate. The reconstruction of the phylogeny of the genus Homo provides ample examples of this problem. 

Another method of validation of an inferred phylogeny, involves demonstrating the internal consistency and robusticity of the inferred phylogeny using a number of different character sets or analytic approaches.  This approach leads to Tukey’s Dictum - “confirmation comes from repetition” (Cooley and Lohnes, 1971:328).  A more formal definition of Tukey’s dictum has been developed under the term “stability” (Jardine and Sibson, 1971) which can be used to quantify the reliance that may be placed on the inferred results (Hull, 1979). Tests of the assumptions of the various mathematical methods used to develop the phylogeny also help validate the inferred structure. In the last five years more powerful statistical techniques using maximum likelihood estimations have become available (Huelsenbeck and Rannala, 1997). These techniques, however, only maximize the accuracy of the inferred tree based on various models of DNA substitutions. They do not necessarily validate the tree itself.

A third method of validating an inferred phylogeny would be to evolve a set of organisms, tracking the exact ancestor - descendent sequence; and then comparing this known phylogenetic tree to an inferred phylogeny derived from the last descendants or “leafs” of the phylogenetic tree. This has been attempted biologically using a few quickly reproducing organisms (Blok and Gibbs, 1995).  Obviously, the organisms must be simple and very quickly evolving, which places serious limitations on this approach. A variation on this approach would be to create a known, but artificial phylogeny, developed according to the principles of evolution. Joseph Camin created such a group of organisms called “Caminalcules” with 29 “recent species” expressly for use in biological classification experiments. Sokal (1966) demonstrated considerable success in classifying these Caminalcules correctly.

This paper takes Camin’s methodology a step further, using a more rigorous mechanisms for producing an artificial phylogeny. A group of artificial life computer programs loosely termed “Tierran”, after the most significant member of the group (Tierra by Thomas Ray, 1991), can generate “genomes” which “evolve” under Darwinian principles.  While Camin’s Caminalcules were evolved by “evolutionary” mechanisms only he himself knew (Sokal, 1966); the artificial life forms created by a “Tierran” system follow clearly defined rules concerning mutation, selection and competition (see appendix A). Furthermore, an exact descendent - ancestor sequence can be logged for the final descendants or “leafs” of the phylogenetic tree, allowing a complete known phylogeny to be developed. The “leafs” can then be analyzed by commonly available phylogenetic programs. The “evolutionary” trees inferred using these programs can then be compared with the trees derived from the known descendent - ancestor data.

 

Alife and Real Life:

Since this paper hinges on the metaphor between artificial life and real life (or wet life), some discussion of the purpose and effectiveness of the Tierran system is in order. First, a metaphor is not an identity! Ray (1991) explicitly created the Tierran system as a metaphor of living systems incorporated in a computer simulation. His intent was to “synthesize rather than simulate life” (Ray, 1996:111). Ray’s view of life goes beyond carbon based life forms found on Earth. He has created a synthetic “Evolution in a bottle” (Ray, 1996:111), which can be used to examine many other examples of life beyond the single instance usually studied by biologists  (Ray, 1997). Artificial life represents the opportunity in Biology that non-Euclidean geometry provided to the generalization of Cartesian space in geometry (Fontana et. al., 1997:324).  It does not invalidate the former space, but challenges the axia from which the space is constructed. Does “life” have to be carbon based? Clearly, not!  Can the synthesis of new life forms outside the classic realms of biology provide insight into the analytic techniques and assumptions use to examine life process? The answer depends on the strength of the synthesis - its ability to redefine core axia within the system, and the rigor with which it is implemented. The Tierran system is without a doubt rigorous, as the source code and algorithms were explicitly engineered to follow known biological principals and can be directly examined and modified by any investigator. It implements the computational equivalents of the key axia of Darwinian evolution: replication, genetic inheritance, mutation, and selection. To simplify the rest of this paper, the terms that are derived from Biology and those used by artificial life programs will be printed without quotes. However, the fact that these terms are metaphors, not identities should always be kept in mind.

 

Materials and Methods:

This study uses two readily available computer programs - MacTierra ( Fraser, 1997) to create an artificial phylogeny and TREECON (Van de Peer, 1996) to infer a phylogeny from the leafs of this artificial phylogeny.. While many other programs are available to evolve artificial life and infer phylogenies, these two were chosen for their rigor, documentation and flexibility. MacTierra consists of a “soup” of dedicated computer memory which is full of “organisms” (critters) each of which is a set of computer instructions. These organisms are processed by simulated computer CPU’s and can replicate themselves in the soup. The critter's computer instructions can be altered by a number of mechanisms simulating cosmic mutation, copy errors and flawed execution. Lastly, a “Reaper” (as in ‘grim’) module “kills”, or removes from the soup, the critters, which have the largest number of program execution errors. Thus a critter is removed from the soup either because it does not reproduce, or it has a large load of non-executable instructions. The results of each evolutionary step are logged in a “genebank”, from which the descendent - ancestor line can be developed, and also contains the exact “genome”, or set of computer instructions, of each critter. The start of all the MacTierra runs studied in this paper begin with a simple, self-replicating genome named “80aab” whose ancestor is “human” since it is programmed by hand (see Appendix A).

The original Tierra as well as MacTierra are specifically designed to simulate simple biological life forms. Ray (1991) borrowed two aspects from the “world of wet life” not usually found in computer code - a pared down set of machine instructions (32, including operands) and “addressing by template”. The small number of instructions was explicitly meant to compare with “the 20 amino acids used to make proteins in real organisms” (Fraser, 1997:1), and the addressing by template mimics the ligand - receptor method of the ‘world of wet life’.  Both of these mechanisms also help avoid the brittleness associated with standard computer code allowing discrete changes in the “genome” to create viable offspring (Fraser, 1997).

A key step in preparing the results from the MacTierra genebank for analysis by TREECON is the conversion of the “Tierran” computer code into a pseudo-protein sequence. This involved a direct mapping of each number and letter (0 ...9, A...F) used in the hexadecimal notation of the Tierran machine code into the letter notation used by most protein analysis programs ( A ...P, except J => Q, O => R, B =>S) (Nieselt-Struwe, 1996). There is no meaning or sense to this mapping, and so care has been taken not to invoke any of the assumptions commonly used in protein sequence analysis which are associated with protein function. This mapping of hexadecimal machine code to protein sequence notation has proven effective in classifying actual DOS computer viruses using real Intel machine code. (Hull, 1995)

TREECON consists of a suite of programs for the construction of evolutionary trees from a variety of molecular biology data(Van de Peer, 1996). TREECON first calculates a distance matrix for all pairs of leafs in the sample. Distance is estimated from the fraction of different pairwise elements between two sequences. Various corrections are available in TREECON to meet different assumptions about substitution rates in the sequence. These were all ignored and the baseline assumption that the rate of substitution is the same for all sites utilized. The resulting distance matrix is then used by the tree construction module to construct the inferred phylogenetic tree. Three basic methods are provided: cluster analysis, transformed distance, and Neighbor-joining. The effects of each of these methods on the inferred tree will be presented in the results. The resulting inferred phylogenetic tree is unrooted.  Since MacTierra begins with a specific, known ancestor (80aab) all the inferred trees from TREECON have the root forced to 80aab.

The MacTierra output must be partitioned and restructured to provide the known data sets for comparison. The genebank containing all the descendent-ancestor data as a simple lookup table and the descendent-genome data were pulled into an EXCEL spreadsheet (MacTierra genebank dumps are designed for analysis in EXCEL). All leafs were then identified ( a leaf is any descendent which was not found on the ancestor side of the descendent-ancestor table, i.e. a genome without descendents). All leafs were then recursively traced to “80aab” using the descendent-ancestor table as a lookup table. This resulted in the known phylogenetic tree. All the leafs’ genomic names were then copied to a separate file which was used to extract their genomes from the MacTierra genebank file and format an input file for TREECON, as described above.

The inferred phylogeny resulting from the TREECON program and the known phylogeny from the trace out of the MacTierra genebank table are presented in several figures. The inferred phylogeny is mounted on the left side and rooted on the left. The known phylogeny is mounted on the right side and rooted on the right. This facing page layout allows the patterns of the inferred and known phylogenies to be compared leaf to leaf.

 Nota bene - the leaf designations are created sequentially within each MacTierra run; thus there is no relationship between leafs of the same name in different MacTierra runs.

 

Results:

Two runs from MacTierra are analyzed: a short run of 50 million instructions and a longer run of 150 million instructions. The results are presented in figures 1 and 2.

Despite the complete loss of functional information involved in mapping machine code instructions into pseudo-protein sequences, the phylogenetic trees produced by TREECON match the know phylogeny of these artificial life organisms well. The general patterns of the evolutionary process are evident. The large group of “hopeful monsters” (Hull, 1973) which managed to reproduce a few times from the root (80aab) and then became extinct are clearly classified in both figure 1 and figure 2. Higher level groups that are tightly clustered due to a successful ancestor with many descendants can also be found in the inferred phylogenies (e.g. group 85aef and 84axv in figure 1 and  77ama, 85aeb and 78aij in figure 2) There is also some evidence of chaining along ancestor -descendent lines in figure 2.(e.g. the critters below 85aeb) Lastly the general patterns of the phylogenetic ranking of the inferred and known trees generally match. Figure 1 shows 7 steps for the known phylogeny and about 6 to 9 major bifurcations for the inferred one (e.g. 7 actual steps for leaf 84bgh and 9 major steps inferred). Likewise figure 2 shows 14 steps for the known phylogeny and 8 to 17 for the inferred one ( e.g. 12 for the leaf 84cpx and 8 inferred).

While the overall fit between the inferred and know phylogenies is good, specific individual fits can be astoundingly bad. 80aiq and 80aie in figure 1, for example are grouped in the inferred phylogeny with the large number of “hopeful monsters”, yet are actually the result of 6 steps in the known phylogeny. On the other hand, 78adt  is one known step from the root, but is classified 9 steps deep in Fig. 1. Likewise, 81acg in figure 2 represents evolution only 1 step from the known root, but is placed 6 steps deep in the inferred phylogeny.

These results appear valid across the three different clustering methods provided by TREECON: neighbor joining, transformation distance, and clustering with UPGMA (see fig. 3)

 

Discussion:

The results indicate that not only do the methods used in TREECON actually work; but also that the critters created by MacTierra are a good metaphor for biological organisms.

The usefulness found in treating computer machine code as pseudo-protein sequences and using methods from molecular biology to classify these as was done with computer viruses (Hull, 1995) is reaffirmed. Data patterns without functional information are apparently adequate to infer valid phylogenies. This strongly supports the basic concepts of Numerical Taxonomy (Sneath and Sokal, 1973). The results also suggest that any purely mathematical classification procedure can be “fooled” in any given instance. This conclusion was also found in the study on computer viruses, in which encryption of the virus code led to misclassification (Hull, 1995). These misclassifications can often be identified by applying Tukey’s dictum, and lead to refinements in the algorithms of the classification procedures. The numerous alternative distance calculations and clustering techniques available in programs such as TREECON are examples of this refinement for protein sequences.

The development of an artificial test group of critters by MacTierra rivals, if not exceeds, the Caminalcules of Camin. Since the critters are produced by a computer program it is possible to generate them much more quickly than the painstaking copy and modification using pen and paper. Furthermore, the use of specific algorithms in artificial life programs like MacTierra provide an exact and public presentation of the methods chosen.  These methods can be evaluated, compared with Darwinian theory, and modified at will. Different original ancestors can be created. Mutation, copy error and flaw rates can be modified. Even the type and severity of selection can be tailored to examine different scenarios.

Finally, there is the question of the validity of a two-part model. Neither part of this model was created with any knowledge of the other.  TREECON was developed solely with the analysis of nucleic and amino acid sequences in mind. MacTierra was developed from Tierra simply to explore principals of artificial life with a metaphor from the biological world. For their results to literally be fitted end-to-end and match with any reasonable degree of accuracy suggests that both parts of the model must be working with the same essential paradigm. Again Tukey’s dictum comes to mind - “confirmation comes from repetition”

 

Conclusions:

The process of mapping the machine code sequences representing the genome of artificial life forms into pseudo-protein sequences for analysis by programs used in molecular biology was highly successful. Overall phylogenetic structure, major subgrouping, and phylogenetic rankings found in the inferred phylogenetic trees match those in the known ancestor - descendent phylogeny.  The finer details of the classifications are, however, not trust-worthy. Individual mappings can be seriously wrong.  The classification algorithms seem to produce consistent results, which suggest the problems exist  in the application of an algorithm to a “biological” process.

The fact that artificial life forms can be meaningfully classified by these biological programs also supports the contention of the creators of artificial life programs that their programs incorporate the essentials of the biological paradigm.

The combination of artificial life programs developing new examples from the biological paradigm, and tested programs used in the analysis of the biological world represent a powerful tool for exploring Life.

 

Acknowledgments:

Computational resources were provided by the Office of Information Technology, Pomona College. This work relies heavily on the programming skills of Simon Fraser for MacTierra and Yves Van de Peer for TREECON.

 


References:

Blok, J. and Gibbs, A.J. (1995).  Molecular systematics of the flaviviruses and their relatives. In Gibbs, A.J., Calisher, C.H. and Garcia-Arenal, F. (Eds.). Molecular Basis of Virus Evolution. Cambridge University Press. Cambridge, U.K.

Cooley, W.W. and Lohnes, P.R. (1971).  Multivariate Data Analysis. John Wiley and Sons, Inc. New York

Fontana, W., Wagner, G., and Buss, L. (1997). Beyond Digital Naturalism. In Langton, C.G. Artificial Life: An Overview. The M.I.T. Press. Cambridge, MA

Fraser, S. (1997).  MacTierra: Program and Documentation. Vers. 1.8.3

Hennig, W.  (1966). Phylogenetic Systematics. University of Illinois Press,  Urbana, Il

Huelsenbeck, J.P. and Rannala, B. (1997). Phylogenetic Methods Come of Age: Testing Hypotheses in an Evolutionary Context. Science. 276:227-232.

Hull, D.B.  (1979). A Craniometric Study of the Black and White Colobus Illiger 1811 (Primates: Cercopithecoidea). American Journal of Physical Anthropology. 51:2:163 -181

Hull, D.B. (1995). Classifying and Naming Computer Viruses: Part Two - The Stoned Family. Virus Bulletin. Oct. 1995

Hull, D.S. (1973). Darwin and his Critics. Harvard University Press, Cambridge, MA.

Jardine, N. and Sibson, R. (1971). Mathematical Taxonomy.  John Wiley and Sons, Inc. New York

Nieselt-Struwe, K. (1996) Reformat: Part of the Statgeom Package. Vers. 1.0

Ray, T. (1991). Tierra Simulator: Documentation. Vers. 2

Ray, T. (1996). An approach to the synthesis of life. In Boden, M. The Philosophy of Artificial Life. Oxford University Press, Oxford.

Ray, T. (1997). An Evolutionary Approach to Synthetic Biology: Zen and the Art of Creating Life.  In Langton, C.G. Artificial Life: An Overview. The M.I.T. Press. Cambridge, MA

Sneath, P. H. A. and Sokal, R. R. (1973). Numerical Taxonomy. W. H. Freeman and Company. San Francisco

Sokal R. R. (1966). Numerical Taxonomy. Scientific American. 215:6:106 - 117

Van de Peer, Y. (1996). TREECON:  Program and Documentation. Vers. 3.1


Appendix A

Creating life forms

 

The two techniques used to create the artificial life forms discussed in this paper are radically different, even though intended to model the same phenomenon.  Camin developed new generations of Caminalcules by tracing the drawing of the ancestor to another sheet of paper and then making the desired evolutionary changes in the offspring. (Sokal, 1966) (see Figure 4). No particular methodology for introducing change has been presented, but clearly incremental  differences in morphology were introduced methodically, as well as the inevitable errors in the copying procedure.

The Tierran critters evolve under a far more rigorous structure. Figure 5 presents a schematic for 80aab, the original ancestor created by hand (Ray, 1996). It is simply a self replicating machine, subject to the forces of random mutation, copying errors and execution errors built into the Tierran environment. The ancestor first examines itself to determine where in memory it begins and ends.  It then calculates its size and allocates a block of memory of this size for its daughter. It next calls a copy procedure, which copies the entire ancestral set of instructions into the daughter’s memory location. The ancestor then resets its registers, effectively mimicking cell division, and enters the daughter into the reaper queue. Having completed the reproduction loop, it proceeds to reproduce again.

The use of templates, clearly marks this computer code as different. Ordinary assembly language instructions such as this would jump, call or return to a specific address. Furthermore, the Tierran environment allows for random bit flips in the Tierran soup, as well as copying and execution errors, which can change the ancestors or daughter’s code. Starting from a single instance of 80aab, the soup quickly develops into a rich mix of different genomes. ( Ray, 1996)

Tierran critters become extinct based on their inability to produce viable daughters. As the soup reaches a given saturation (usually 80% full), the critters with less successful reproductive records are eliminated by the reaper function. The soup often appears to stabilize with a highly successful genome with many closely related offspring; which after some time is overturned by a more successful genome.