HAM DNA ProjectHAM Surname DNA Project

Research through Genetics


Home Contacts GEDCOMS Links Queries Wills & Estates HAM DNA  Project

Instructions for creating TMRCA Phylogenetic charts from DNA information

I have to thank L. David Roper for posting his instructions and

and Dean McGee as well for making this easy for us!

September, 2005
by Dave Hamm,  N ovi,  MI


TOOLS:

     Whit's Y-Haplogroup Predictor   by Whit Athey

     Y-DNA Comparison Utility by Dean McGee (for  the classical view).
 
    
Y-DNA Comparison Utility by Dean McGee for  the extended number of markers (i.e., 67) from FTDNA.

     PHYLIP   software to generate phylogenetic tree data.

          -  Instructions   for using applying PHYLIP to DNA data, by L. David Roper from the ROPER DNA Project

      MEGA   software for generating Phylogenetic and Network Tree graphs from the PHYLIP data.
  

Instructions:

This would be my version of  L. David Roper's set of instructions.  I have adapted them for TMRCA, and stepped through it a bit more.
For more detailed instructions, see the documentation that comes with each individual package.

 - Preparation:

  1) Bookmark the Y-Haplogroup Predictor  by Whit Athey and the Y-DNA Comparison utility by Dean McGee (both listed above as the "tools").
  2) Download and install the PHYLIP and MEGA software tools listed above.
  3) Bookmark the Instructions given above from the ROPER DNA Project.

  4) Make sure that your data is in a format acceptable to Dean McGee's Y-DNA Comparison Utility.
       You might as well also make it acceptable format for the PHYLIP software as well, which will be used later.

       If you can cut-and-paste directly from the FTDNA web page into Dean McGee's Y-DNA Comparison Utility (and it works), then you can skip this step.

      Many folks can just cut-and-paste the data from the FTDNA page, but others also have data from other vendors or from the National Genographic Project.  For those who do not have their data already formatted for them, you MUST arrange it in an text file with the following format:

23651    13    23    17    9    12    12    11    13    13    14    11    29    16    8    9    11    11    24    15    21    29    11    14    14    15
23921    13    25    14    11    11    13    12    12    12    13    14    30    18    9    10    11    11    25    15    18    30    15    16    16    17    11    12    19    23    17    16    17    17    37    37    12    12
 
   - where the first field (as in "23651") is the participant User ID number. This could be a name and haplotype group, for example. But, please remember that later, when we use the PHYLIP program, it expect the first field not to exceed 10 characters. So, be sure not to exceed 10 characters in that first field before the DYS data begins for the line.

  -  You will notice that the above data has 25 marker results for user ID "23651" whereas User ID number "23921" has a 37 marker test. So here, be sure to wrap your line if it exceeds the width of the text editor window.

    I usually call this raw data file something like   SURNAME_raw_data.txt

 - Procedure:

  5)  Cut-and-paste the data into Dean McGee's Y-DNA Comparison Utility.

      - Visit Dean McGee's Y-DNA Utility on the internet.
      - select:

                 Y-DNA Comparision Utility, FTDNA Mode   for Y-Search mode

                 Y-DNA Comparision Utility, FTDNA Mode   for FTDNA mode

                 If your limit is 67 FTDNA markers, then under the "Generate Tables" box, type in "67" for the "Max alleles."
                 If you need to cover special cases (such as DYS464e), then you will need to modify the "Marker exists" boxes, and include some character in your data (such as minus "-" ) where DYS464e does not exist.  So, in some cases, you may need to pay attention to what is selected for the marker list here as well as what is in your data.

      - select 
                    - LEFT column                                                                                      MIDDLE column

                
                  For Genetic Distance, use "Infinite allele mutation model"                - de-select  "modal haplotype"

                  Probability:       95 %                                                                                  -  select   "Show Mutation Rates"
                  under TMRCA, select "Generate PHYLIP data"
                  mutation rate:   FTDNA  at 0.004 
                  For "Units" select "Years"     - and type in "25"  years/generation

      - click on "execute"

     If there are any errors (in the "Debug" window), then check your data very closely, or go back to step 4). 
     If you have problems, his instructions can be found at the bottom of the Utility.

     If you have a large file, you may observe that the first line of Dean McGee's Y-DNA Utility will give a "Status" box, and will indicate the number of lines that are being analyzed.  (That line count will automatically be plugged in as the count of the total number of lines in the next step.)  For very large files, you may want to de-select things that you do not need, such as "Generate Fluxus Data" and/or "Show Mutation Rates."

If you are prompted with a popup about system resources, click on "cancel" to enable the program to keep on running.

  6)  When you have a successful run from Dean McGee's Y-DNA Utility,

         - scroll down to the very bottom of the page to the section titled   "Time to Most Recent Common Ancestor (Years)"

         -  Select all of the data in the "PHYLIP compatible TMRCA table."

         - select "copy" from the "Edit" menu item.
         - paste this into a new text file called   infile_SURNAME_95.txt, using notepad.

         - Then, save and exit notepad.

 - Convert tha data to phylogenetic tree format:

  7)  Copy the "infile_SURNAME_95.txt"  to the PHYLIP executable area.

        DISCUSSION:

              The "distance.html" documentation that accompanies the PHYLIP package explains:

              FITCH and the Neighbor-Joining option of NEIGHBOR fit a tree which has the branch lengths unconstrained.
              KITSCH and the UPGMA option of NEIGHBOR, by contrast, assume that an "evolutionary clock" is valid,
              according to which the true branch lengths from the root of the tree to each tip are the same: the expected
              amount of evolution in any lineage is proportional to elapsed time.

              So, basically, for TMRCA, we would want to use either the:

                   -  "neighbor" program with the "UPGMA" option, or
                   -  "kitsch" program with the default "Fitch-Margoliash" option.

           PROCEDURE:

        - Run the "kitsch" program within the PHYLIP package.
             You could have alternatively run the "neighbor" program, but L. David Roper indicates that "kitsch" is more accurate.
             For the "neighbor program, L. David Roper suggests that you use the
                       UPGMA option (Un-weighed Pair Group Method with Arithmetic Mean)

             For the "kitsch" program, take the  "Fitch-Margoliash" method.

        -  When prompted for an input file name, indicate the input file to be:    infile_SURNAME_95.txt
        -  When prompted for an output file name,
                   -  type "f"  to write to a new File
                   -  When prompted for a new file name, indicate:     outfile_SURNAME_95.txt

        -  For the "kitsch" program, take the "D" option for the "Fitch-Margoliash" Method
           For the "neighbor" program, select the "UPGMA" Method.

        -  Select option  "L"  for "Lower trangular data matrix"

        -  Select option  "J"  for "Randomize input order of species"

                   - when prompted for a "Random number seed (must be odd)," enter:    9
                   - when prompted for "Number of times to jumble"  enter:
                                                                                  for small sets of data (around 20), use:           99
                                                                                  for large sets of data, use:                                11

                    The number of jumbles will affect the length of time it takes to execute large sets of data.
                     A lower number for large data sets will mean less computer time required.
                     For example, 99 jumps on 150 lines of data could take over 6 hours to execute on a 3 GHz computer.

                     For example, 11 jumps on 150 lines of data could take over 1 hour to execute on a 3 GHz computer.
                    A larger number of jumps for smaller sets of data will increase the accuracy.
 
        -  Type "y" to start the calculations.

        - when prompted for an "outtree" file name, 
                     - type "f" to write to a new file
                     - enter:             outtree_SURNAME_95.txt

            NOTE:
                  If you get the error similar to:
                                                   diagonal element of row 3 of distance matrix is not zero.
                                                   Is it a distance matrix?
                 
                  then, you probably either forgot to remove the "modal" column from the input data,
                      or forgot to enter the number of lines in the data.

                  exit the kitsch program and Go back to Step (6).

            NOTE:
                  If you get the error similar to:
                                                   end-of-line or end-of-file in middle of species name for species xxx

                then, your line count is probably wrong.
                exit the kitsch program and Go back to Step (6).

         If the data has not been sorted into groups, this can take a while to run for large sets of data.

 - View the "tree" formatted output:

   8)   Rename the output file "outtree_SURNAME_95.txt"  to "outtree_SURNAME_95.tre"
   9)   If you have MEGA installed, then you can simply double click on the file name.
          Otherwise, start Mega and select your   *.tre  file.
  10)  Cut-and-paste from the MEGA software package by selecting
          - image from the main menu items
          - select "Copy to clipboard"

           Paste this into your favorite paint package and save it off in JPEG format.

 - Dave Hamm   September, 2005 (updated October, 2006)

Examples:

COCKERHAM DNA TMRCA phylogenetic tree

COCKERHAM DNA network tree

References: 

    Calculating Time to Most Recent Common Ancestor (TMRCA)  by Dr. Bruce Walsh, University of Arizona, FTDNA's Advisory Board

    Time to Most Recent Common Ancestor    (PDF file) by Dr. Bruce Walsh, University of Arizona.  2001. Published by the Genetics Society of America.  The math behind the theory, for the more curious.

    Haplogroups of the World - Doug McDonald's Map (PDF file) of the distribution of Y-DNA and mtDNA

   Instructions   for using applying PHYLIP to DNA data, by L. David Roper from the ROPER DNA Project


  • Back to HAM Country