Answering Camin’s Challenge:
The Comparison of Known and Inferred Phylogenies
David B. Hull, Ph.D.
DeVry Institute of
Technology
Pomona, CA
12/13/98
Unfortunately
these pages work best under Microsoft's I.E. browser
Author's
Address:
David
B. Hull, DeVry Institute of
Technology, 901 Corporate Drive, Pomona, CA 91768
E-mail: dhull@admin.pom.devry.edu.
The
full text of this paper is available at
http://minerva.pom.devry.edu/~dhull/aaas/
Introduction:
Classification
is one of the fundamental concerns in evolutionary biology. Much time and
energy has been spent on creating “natural” classifications inferred from a
wide variety of observed characters The validity of the methods used to create
these classifications are rarely tested, although often argued (e.g Hennig,
1966 and Sneath and Sokal, 1973). Since the actual phylogeny of the organisms
being classified is almost always unknown (and probably unknowable); several
techniques have been used to validate the inferred phylogeny.
Fossil
material, when available, can be fitted into the inferred phylogeny. This
provides both an absolute time line, and constraints on the possible
phylogenetic reconstructions. Fossil
material, is, however, incomplete in both quantity and the types of traits
available. Furthermore, the relationships of the fossil material is often
itself the subject of heated debate. The reconstruction of the phylogeny of the
genus Homo provides ample examples of
this problem.
Another
method of validation of an inferred phylogeny, involves demonstrating the
internal consistency and robusticity of the inferred phylogeny using a number
of different character sets or analytic approaches. This approach leads to Tukey’s Dictum - “confirmation comes from
repetition” (Cooley and Lohnes, 1971:328).
A more formal definition of Tukey’s dictum has been developed under the
term “stability” (Jardine and Sibson, 1971) which can be used to quantify the
reliance that may be placed on the inferred results (Hull, 1979). Tests of the
assumptions of the various mathematical methods used to develop the phylogeny
also help validate the inferred structure. In the last five years more powerful
statistical techniques using maximum likelihood estimations have become
available (Huelsenbeck and Rannala, 1997). These techniques, however, only
maximize the accuracy of the inferred tree based on various models of DNA
substitutions. They do not necessarily validate the tree itself.
A
third method of validating an inferred phylogeny would be to evolve a set of
organisms, tracking the exact ancestor - descendent sequence; and then
comparing this known phylogenetic tree to an inferred phylogeny derived from
the last descendants or “leafs” of the phylogenetic tree. This has been
attempted biologically using a few quickly reproducing organisms (Blok and
Gibbs, 1995). Obviously, the organisms
must be simple and very quickly evolving, which places serious limitations on
this approach. A variation on this approach would be to create a known, but
artificial phylogeny, developed according to the principles of evolution.
Joseph Camin created such a group of organisms called “Caminalcules” with 29
“recent species” expressly for use in biological classification experiments.
Sokal (1966) demonstrated considerable success in classifying these
Caminalcules correctly.
This
paper takes Camin’s methodology a step further, using a more rigorous
mechanisms for producing an artificial phylogeny. A group of artificial life
computer programs loosely termed “Tierran”, after the most significant member
of the group (Tierra by Thomas Ray, 1991), can generate “genomes” which
“evolve” under Darwinian principles.
While Camin’s Caminalcules were evolved by “evolutionary” mechanisms
only he himself knew (Sokal, 1966); the artificial life forms created by a
“Tierran” system follow clearly defined rules concerning mutation, selection
and competition (see appendix A). Furthermore, an exact descendent - ancestor
sequence can be logged for the final descendants or “leafs” of the phylogenetic
tree, allowing a complete known phylogeny to be developed. The “leafs” can then
be analyzed by commonly available phylogenetic programs. The “evolutionary”
trees inferred using these programs can then be compared with the trees derived
from the known descendent - ancestor data.
Alife
and Real Life:
Since
this paper hinges on the metaphor between artificial life and real life (or wet
life), some discussion of the purpose and effectiveness of the Tierran system
is in order. First, a metaphor is not an identity! Ray (1991) explicitly
created the Tierran system as a metaphor of living systems incorporated in a
computer simulation. His intent was to “synthesize rather than simulate life”
(Ray, 1996:111). Ray’s view of life goes beyond carbon based life forms found
on Earth. He has created a synthetic “Evolution in a bottle” (Ray, 1996:111),
which can be used to examine many other examples of life beyond the single
instance usually studied by biologists
(Ray, 1997). Artificial life represents the opportunity in Biology that
non-Euclidean geometry provided to the generalization of Cartesian space in geometry
(Fontana et. al., 1997:324). It does
not invalidate the former space, but challenges the axia from which the space
is constructed. Does “life” have to be carbon based? Clearly, not! Can the synthesis of new life forms outside
the classic realms of biology provide insight into the analytic techniques and
assumptions use to examine life process? The answer depends on the strength of
the synthesis - its ability to redefine core axia within the system, and the
rigor with which it is implemented. The Tierran system is without a doubt
rigorous, as the source code and algorithms were explicitly engineered to
follow known biological principals and can be directly examined and modified by
any investigator. It implements the computational equivalents of the key axia
of Darwinian evolution: replication, genetic inheritance, mutation, and
selection. To simplify the rest of this paper, the terms that are derived from
Biology and those used by artificial life programs will be printed without
quotes. However, the fact that these terms are metaphors, not identities should always be kept in
mind.
Materials
and Methods:
This study uses two readily available computer
programs - MacTierra ( Fraser, 1997)
to create an artificial phylogeny and TREECON
(Van de Peer, 1996) to infer a phylogeny from the leafs of this artificial
phylogeny.. While many other programs are available to evolve artificial life
and infer phylogenies, these two were chosen for their rigor, documentation and
flexibility. MacTierra consists of a “soup” of dedicated computer memory which
is full of “organisms” (critters) each of which is a set of computer
instructions. These organisms are processed by simulated computer CPU’s and can
replicate themselves in the soup. The critter's computer instructions can be
altered by a number of mechanisms simulating cosmic mutation, copy errors and
flawed execution. Lastly, a “Reaper” (as in ‘grim’) module “kills”, or removes
from the soup, the critters, which have the largest number of program execution
errors. Thus a critter is removed from the soup either because it does not
reproduce, or it has a large load of non-executable instructions. The results
of each evolutionary step are logged in a “genebank”, from which the descendent
- ancestor line can be developed, and also contains the exact “genome”, or set
of computer instructions, of each critter. The start of all the MacTierra runs
studied in this paper begin with a simple, self-replicating genome named
“80aab” whose ancestor is “human” since it is programmed by hand (see Appendix
A).
The
original Tierra as well as MacTierra are specifically designed to simulate
simple biological life forms. Ray (1991) borrowed two aspects from the “world
of wet life” not usually found in computer code - a pared down set of machine
instructions (32, including operands)
and “addressing by template”. The small number of instructions was explicitly
meant to compare with “the 20 amino acids used to make proteins in real
organisms” (Fraser, 1997:1), and the addressing by template mimics the ligand -
receptor method of the ‘world of wet life’.
Both of these mechanisms also help avoid the brittleness associated with
standard computer code allowing discrete changes in the “genome” to create
viable offspring (Fraser, 1997).
A
key step in preparing the results from the MacTierra genebank for analysis by
TREECON is the conversion of the “Tierran” computer code into a pseudo-protein
sequence. This involved a direct mapping of each number and letter (0 ...9,
A...F) used in the hexadecimal notation of the Tierran machine code into the
letter notation used by most protein analysis programs ( A ...P, except J =>
Q, O => R, B =>S) (Nieselt-Struwe, 1996). There is no meaning or sense to this mapping, and so
care has been taken not to invoke any of the assumptions commonly used in
protein sequence analysis which are associated with protein function. This
mapping of hexadecimal machine code to protein sequence notation has proven
effective in classifying actual DOS computer viruses using real Intel machine
code. (Hull, 1995)
TREECON consists of a suite of programs for
the construction of evolutionary trees from a variety of molecular biology
data(Van de Peer, 1996). TREECON first calculates a distance matrix for all
pairs of leafs in the sample. Distance is estimated from the fraction of
different pairwise elements between two sequences. Various corrections are
available in TREECON to meet different assumptions about substitution rates in
the sequence. These were all ignored and the baseline assumption that the rate
of substitution is the same for all sites utilized. The resulting distance
matrix is then used by the tree construction module to construct the inferred
phylogenetic tree. Three basic methods are provided: cluster analysis,
transformed distance, and Neighbor-joining. The effects of each of these
methods on the inferred tree will be presented in the results. The resulting
inferred phylogenetic tree is unrooted.
Since MacTierra begins with a specific, known ancestor (80aab) all the
inferred trees from TREECON have the root forced to 80aab.
The
MacTierra output must be partitioned and restructured to provide the known data
sets for comparison. The genebank containing all the descendent-ancestor data
as a simple lookup table and the descendent-genome data were pulled into an
EXCEL spreadsheet (MacTierra genebank dumps are designed for analysis in
EXCEL). All leafs were then identified ( a leaf is any descendent which was not
found on the ancestor side of the descendent-ancestor table, i.e. a genome without
descendents). All leafs were then recursively traced to “80aab” using the
descendent-ancestor table as a lookup table. This resulted in the known
phylogenetic tree. All the leafs’ genomic names were then copied to a separate
file which was used to extract their genomes from the MacTierra genebank file
and format an input file for TREECON, as described above.
The
inferred phylogeny resulting from the TREECON program and the known phylogeny
from the trace out of the MacTierra genebank table are presented in several
figures. The inferred phylogeny is mounted on the left side and rooted on the
left. The known phylogeny is mounted on the right side and rooted on the right.
This facing page layout allows the patterns of the inferred and known
phylogenies to be compared leaf to leaf.
Nota bene - the leaf designations are created
sequentially within each MacTierra run; thus there is no relationship between
leafs of the same name in different MacTierra runs.
Results:
Two
runs from MacTierra are analyzed: a short run of 50 million instructions and a
longer run of 150 million instructions. The results are presented in figures 1
and 2.
Despite
the complete loss of functional information involved in mapping machine code
instructions into pseudo-protein sequences, the phylogenetic trees produced by
TREECON match the know phylogeny of these artificial life organisms well. The
general patterns of the evolutionary process are evident. The large group of
“hopeful monsters” (Hull, 1973) which managed to reproduce a few times from the
root (80aab) and then became extinct are clearly classified in both figure 1
and figure 2. Higher level groups that are tightly clustered due to a
successful ancestor with many descendants can also be found in the inferred
phylogenies (e.g. group 85aef and 84axv in figure 1 and 77ama, 85aeb and 78aij in figure 2) There is
also some evidence of chaining along ancestor -descendent lines in figure
2.(e.g. the critters below 85aeb) Lastly the general patterns of the
phylogenetic ranking of the inferred and known trees generally match. Figure 1
shows 7 steps for the known phylogeny and about 6 to 9 major bifurcations for
the inferred one (e.g. 7 actual steps for leaf 84bgh and 9 major steps
inferred). Likewise figure 2 shows 14 steps for the known phylogeny and 8 to 17
for the inferred one ( e.g. 12 for the leaf 84cpx and 8 inferred).
While
the overall fit between the inferred and know phylogenies is good, specific
individual fits can be astoundingly bad. 80aiq and 80aie in figure 1, for
example are grouped in the inferred phylogeny with the large number of “hopeful
monsters”, yet are actually the result of 6 steps in the known phylogeny. On
the other hand, 78adt is one known step
from the root, but is classified 9 steps deep in Fig. 1. Likewise, 81acg in
figure 2 represents evolution only 1 step from the known root, but is placed 6
steps deep in the inferred phylogeny.
These
results appear valid across the three different clustering methods provided by
TREECON: neighbor joining, transformation distance, and clustering with UPGMA
(see fig. 3)
Discussion:
The
results indicate that not only do the methods used in TREECON actually work;
but also that the critters created by MacTierra are a good metaphor for
biological organisms.
The
usefulness found in treating computer machine code as pseudo-protein sequences
and using methods from molecular biology to classify these as was done with
computer viruses (Hull, 1995) is reaffirmed. Data patterns without functional
information are apparently adequate to infer valid phylogenies. This strongly
supports the basic concepts of Numerical Taxonomy (Sneath and Sokal, 1973). The
results also suggest that any purely mathematical classification procedure can
be “fooled” in any given instance. This conclusion was also found in the study
on computer viruses, in which encryption of the virus code led to
misclassification (Hull, 1995). These misclassifications can often be
identified by applying Tukey’s dictum, and lead to refinements in the
algorithms of the classification procedures. The numerous alternative distance
calculations and clustering techniques available in programs such as TREECON
are examples of this refinement for protein sequences.
The
development of an artificial test group of critters by MacTierra rivals, if not
exceeds, the Caminalcules of Camin. Since the critters are produced by a
computer program it is possible to generate them much more quickly than the
painstaking copy and modification using pen and paper. Furthermore, the use of
specific algorithms in artificial life programs like MacTierra provide an exact
and public presentation of the methods chosen.
These methods can be evaluated, compared with Darwinian theory, and
modified at will. Different original ancestors can be created. Mutation, copy
error and flaw rates can be modified. Even the type and severity of selection
can be tailored to examine different scenarios.
Finally,
there is the question of the validity of a two-part model. Neither part of this
model was created with any knowledge of the other. TREECON was developed solely with the analysis of nucleic and
amino acid sequences in mind. MacTierra was developed from Tierra simply to
explore principals of artificial life with a metaphor from the biological
world. For their results to literally be fitted end-to-end and match with any
reasonable degree of accuracy suggests that both parts of the model must be
working with the same essential paradigm. Again Tukey’s dictum comes to mind -
“confirmation comes from repetition”
Conclusions:
The
process of mapping the machine code sequences representing the genome of
artificial life forms into pseudo-protein sequences for analysis by programs
used in molecular biology was highly successful. Overall phylogenetic
structure, major subgrouping, and phylogenetic rankings found in the inferred
phylogenetic trees match those in the known ancestor - descendent
phylogeny. The finer details of the
classifications are, however, not trust-worthy. Individual mappings can be
seriously wrong. The classification
algorithms seem to produce consistent results, which suggest the problems
exist in the application of an
algorithm to a “biological” process.
The
fact that artificial life forms can be meaningfully classified by these
biological programs also supports the contention of the creators of artificial
life programs that their programs incorporate the essentials of the biological
paradigm.
The
combination of artificial life programs developing new examples from the
biological paradigm, and tested programs used in the analysis of the biological
world represent a powerful tool for exploring Life.
Acknowledgments:
Computational
resources were provided by the Office of Information Technology, Pomona
College. This work relies heavily on the programming skills of Simon Fraser for
MacTierra and Yves Van de Peer for TREECON.
References:
Blok, J. and Gibbs, A.J. (1995). Molecular systematics of the flaviviruses
and their relatives. In Gibbs, A.J., Calisher, C.H. and Garcia-Arenal, F.
(Eds.). Molecular Basis of Virus
Evolution. Cambridge University
Press. Cambridge, U.K.
Cooley, W.W. and Lohnes, P.R. (1971). Multivariate
Data Analysis. John Wiley and Sons, Inc. New York
Fontana, W., Wagner, G., and Buss, L. (1997). Beyond
Digital Naturalism. In Langton, C.G. Artificial
Life: An Overview. The M.I.T. Press. Cambridge, MA
Fraser, S. (1997).
MacTierra: Program and
Documentation. Vers. 1.8.3
Hennig, W.
(1966). Phylogenetic Systematics.
University of Illinois Press, Urbana,
Il
Huelsenbeck, J.P. and Rannala, B. (1997).
Phylogenetic Methods Come of Age: Testing Hypotheses in an Evolutionary
Context. Science. 276:227-232.
Hull, D.B.
(1979). A Craniometric Study of the Black and White Colobus
Illiger 1811 (Primates: Cercopithecoidea). American
Journal of Physical Anthropology. 51:2:163 -181
Hull, D.B. (1995). Classifying and Naming Computer
Viruses: Part Two - The Stoned Family. Virus
Bulletin. Oct. 1995
Hull, D.S. (1973). Darwin and his Critics. Harvard University Press, Cambridge, MA.
Jardine, N. and Sibson, R. (1971). Mathematical Taxonomy. John Wiley and Sons, Inc. New York
Nieselt-Struwe, K. (1996) Reformat: Part of the Statgeom Package. Vers. 1.0
Ray, T. (1991). Tierra
Simulator: Documentation. Vers. 2
Ray, T. (1996). An approach to the synthesis of
life. In Boden, M. The Philosophy of
Artificial Life. Oxford University Press, Oxford.
Ray, T. (1997). An Evolutionary Approach to
Synthetic Biology: Zen and the Art of Creating Life. In Langton, C.G. Artificial
Life: An Overview. The M.I.T. Press. Cambridge, MA
Sneath, P. H. A. and Sokal, R. R. (1973). Numerical Taxonomy. W. H. Freeman and
Company. San Francisco
Sokal R. R. (1966). Numerical Taxonomy. Scientific American. 215:6:106 - 117
Van de Peer, Y. (1996). TREECON: Program and
Documentation. Vers. 3.1
Appendix A
The two techniques used to create the artificial
life forms discussed in this paper are radically different, even though
intended to model the same phenomenon.
Camin developed new generations of Caminalcules by tracing the drawing
of the ancestor to another sheet of paper and then making the desired
evolutionary changes in the offspring. (Sokal, 1966) (see Figure 4). No
particular methodology for introducing change has been presented, but clearly
incremental differences in morphology
were introduced methodically, as well as the inevitable errors in the copying
procedure.
The Tierran critters evolve under a far more
rigorous structure. Figure 5 presents a schematic for 80aab, the original
ancestor created by hand (Ray, 1996). It is simply a self replicating machine,
subject to the forces of random mutation, copying errors and execution errors
built into the Tierran environment. The ancestor first examines itself to
determine where in memory it begins and ends.
It then calculates its size and allocates a block of memory of this size
for its daughter. It next calls a copy procedure, which copies the entire
ancestral set of instructions into the daughter’s memory location. The ancestor
then resets its registers, effectively mimicking cell division, and enters the
daughter into the reaper queue. Having completed the reproduction loop, it
proceeds to reproduce again.
The use of templates, clearly marks this computer
code as different. Ordinary assembly language instructions such as this would
jump, call or return to a specific address. Furthermore, the Tierran
environment allows for random bit flips in the Tierran soup, as well as copying
and execution errors, which can change the ancestors or daughter’s code.
Starting from a single instance of 80aab, the soup quickly develops into a rich
mix of different genomes. ( Ray, 1996)
Tierran critters become extinct based on their
inability to produce viable daughters. As the soup reaches a given saturation
(usually 80% full), the critters with less successful reproductive records are
eliminated by the reaper function. The soup often appears to stabilize with a
highly successful genome with many closely related offspring; which after some
time is overturned by a more successful genome.







