Instructions for creating TMRCA Phylogenetic charts from DNA information

I have to thank L. David Roper for posting his instructions and

and Dean McGee as well for making this easy for us.

created:  September, 2005
last updated:  Feb, 2013

by Dave Hamm,  Franklin,  OH


     Whit's Y-Haplogroup Predictor   by Whit Athey

     Y-DNA Comparison Utility by Dean McGee 

     Y-DNA Comparision Utility, FTDNA Mode   Dean McGee's Y-DNA Utility option for FTDNA mode

     Y-DNA Comparision Utility, 111 FTDNA Mode   Dean McGee's Y-DNA Utility option for FTDNA mode, 111 markers

     PHYLIP   software to generate phylogenetic tree data.

          -  Instructions   for using applying PHYLIP to DNA data, by L. David Roper from the ROPER DNA Project

      MEGA   software for generating Phylogenetic and Network Tree graphs from the PHYLIP data.


This would be my version of  L. David Roper's set of instructions.  I have adapted them for TMRCA, and stepped through it a bit more.
For more detailed instructions, see the documentation that comes with each individual package.

 - Preparation:

  1) Bookmark the Y-Haplogroup Predictor  by Whit Athey and the Y-DNA Comparison utility by Dean McGee (both listed above as the "tools").
  2) Download and install the PHYLIP and MEGA software tools listed above.
  3) Bookmark the Instructions given above from the ROPER DNA Project.

  4) Make sure that your data is in a format acceptable to Dean McGee's Y-DNA Comparison Utility.
       You will want to also make it acceptable format for the PHYLIP software as well, which will be explained later.

       If you can cut-and-paste directly from the FTDNA web page into Dean McGee's Y-DNA Comparison Utility (and it works), then you can skip this step.

      Many folks can just cut-and-paste the data from the FTDNA page, but others also have data from other vendors or from the National Genographic Project.  For those who do not have their data already formatted for them, you MUST arrange it in an text file with the format following the FTDNA notation:

42370 WmNC    13     22     15     10     13-14     11     14     11     13     11     29     14     8-9     8     11     23     16     20     27     12-14-15-16     11     10     19-21     14     14     16     20     35-36     11     10     11     8     15-16     9     11     10     8     9     9     12     22-25     14     10     12     12     14     8     12     25     20     13     13     11     12     11     11     12     11     32     12     8     17     12     24     27     19     11     11     12     13     11     9     11     11     10     12     12     31     11     13     21     16     11     10     24     15     18     12     26     17     13     15     25     12     23     18     12     14     18     9     12     11
205092 PTN    13     24     14     11     11-14     12     12     13     13     13     29     16     9-10     11     11     25     16     19     30     15-15-16-17     11     12     19-23     16     15     18     15     37-37     12     12    11     9     15-16     8     10     10     8     10     10     12     23-23     18     10     12     12     16     8     12     22     21     14     12     11     13     11     12     13     12     35     15     9     16     12     28     27     19     12     11     13     12     10     9     12     12     10     11     11     30     12     13     24     13     10     10     20     15     20     13     24     18     13     15     24     12     23     18     10     14     17     9     12     11

   - where the first field (as in "43270 WmNC") is the participant User ID number and an abbreviated description. This could be a name and haplotype group, for example. But, please remember that later, when we use the PHYLIP program, it expect the first field not to exceed 10 characters. So, be sure not to exceed 10 characters in that first field before the DYS data begins for the line.

The newer version of McGee's Utility will only accept the FTDNA format with the "minus" ("-") notation for multiple alleles. This is a new format from FTDNA, McGee's older version had no "-" separator.

You can place this format into a notepad file, and later cut-and-paste into McGee's Utility.
Two simple things to remember for the cut-and-paste:

1) You must save off to notepad and exit before opening again for cut-and-paste.
2) McGee's Utility expects each line to start with a kit number or ID. So, after cut-and-paste, check that each line in McGee's Utility begins with a kit ID.

  -  Some of your data may have a mixture of 37 marker results, 67 marker results, and 111 marker results. So here, be sure to wrap your line if it exceeds the width of the text editor window.


a) The first field should contain 10 characters, and can be padded with spaces. If you are dealing with data from FTDNA, remove the column for the SNP haplotype group.

b) McGee's new Y-DNA 111 marker version of the Utility expects to see the "new" data format from FTDNA. That format now places the palindrones (YCAIIa/b, CDYa/b, DYS464a/b/c/d, etc) into one column, with the values separated by the minus sign ("-"). Which means, the new data format from FTDNA will contain minus signs instead of the old space (" ") character between some of the data. For the data from FTDNA that is separated by the minus sign, McGee's Utility no longer accepts the space (or tab) character. So, be sure to include the minus sign characters if you are using the new Y-DNA 111 marker mode.

   Cut-and-paste this text data into a text document ("filename.txt") for later use.
   I suggest calling this raw data file something like   SURNAME_to_Paste_Into_McGee.txt

 - Procedure:

  5)  Next,

      - Visit Dean McGee's Y-DNA Utility on the internet.
      - select:

                 Y-DNA Comparision Utility, YSearch Mode   for Y-Search mode

                 Y-DNA Comparision Utility, FTDNA Mode   for FTDNA mode

                 New: BETA Y-DNA Comparision Utility: 111 Allele  for the new version that accepts the FTDNA hyphens ("-").

                 If you need to cover special cases (such as DYS464e), then you will need to modify the "Marker exists" boxes, and include some character in your data (such as minus "-" ) where DYS464e does not exist.  So, in some cases, you may need to pay attention to what is selected for the marker list here as well as what is in your data. In general, his utility will accept a "minus" for data that does not exist.

      - select 
                    - LEFT column                                                                                      MIDDLE column

                  For Genetic Distance, use "Hybrid" mutation model                          - de-select  "modal haplotype"

                  Probability:       95 %                                                                                  -  select   "Show Mutation Rates"
                  under TMRCA, select "Generate PHYLIP data"
                  mutation rate:   FTDNA  at 0.004 
                  For "Units" select "Years"     - and type in "25"  years/generation

       Cut-and-paste the data into Dean McGee's Y-DNA Comparison Utility.

       If you have pasted correctly, each line should start on the left side of the input box with the kit number or name.
      - click on "execute"

     If there are any errors (in the "Debug" window), then check your data very closely, or go back to step 4). 
     If you have problems, his instructions can be found at the bottom of the Utility.

     If you are interested in experimenting with individual marker mutation rates, the new 111 marker version lets you "Execute Setup" for that purpose. (A normal "Execute" will generate the default setup for you, as the "Show Setup" button is enabled by default.

     If you have a large file, you may observe that the first line of Dean McGee's Y-DNA Utility will give a "Status" box, and will indicate the number of lines that are being analyzed.  (That line count will automatically be plugged in as the count of the total number of lines in the next step.)  For very large files, you may want to de-select things that you do not need, such as "Generate Fluxus Data" and/or "Show Mutation Rates."

If you are prompted with a popup about system resources, click on "continue" to enable the program to keep on running.

  6)  When you have a successful run from Dean McGee's Y-DNA Utility,

         - scroll down to the very bottom of the page to the section titled   "Time to Most Recent Common Ancestor (Years)"

         -  Select all of the data in the "PHYLIP compatible TMRCA table."

         - select "copy" from the "Edit" menu item.
         - paste this into a new text file called   infile_SURNAME_95.txt, using notepad.

         - Then, save and exit notepad.

 - Convert tha data to phylogenetic tree format:

  7)  Copy the "infile_SURNAME_95.txt"  to your PHYLIP executable area.
        (You should have the PHYLIP package saved to a folder on your computer disk.)


              The "distance.html" documentation that accompanies the PHYLIP package explains:

              FITCH and the Neighbor-Joining option of NEIGHBOR fit a tree which has the branch lengths unconstrained.
              KITSCH and the UPGMA option of NEIGHBOR, by contrast, assume that an "evolutionary clock" is valid,
              according to which the true branch lengths from the root of the tree to each tip are the same: the expected
              amount of evolution in any lineage is proportional to elapsed time.

              So, basically, for TMRCA, we would want to use either the:

                   -  "neighbor" program with the "UPGMA" option, or
                   -  "kitsch" program with the default "Fitch-Margoliash" option.

              ( I use the "kitsch" program.)


        - Run the "kitsch" program within the PHYLIP package.
             You could have alternatively run the "neighbor" program, but L. David Roper indicates that "kitsch" is more accurate.
             For the "neighbor program, L. David Roper suggests that you use the
                       UPGMA option (Un-weighed Pair Group Method with Arithmetic Mean)

             For the "kitsch" program, take the  "Fitch-Margoliash" method.

        -  When prompted for an input file name, indicate the input file to be:    infile_SURNAME_95.txt
        -  When prompted for an output file name,
                   -  type "f"  to write to a new File
                   -  When prompted for a new file name, indicate:     outfile_SURNAME_95.txt

        -  For the "kitsch" program, take the "D" option for the "Fitch-Margoliash" Method
           For the "neighbor" program, select the "UPGMA" Method.

        -  Select option  "L"  for "Lower trangular data matrix"

        -  Select option  "J"  for "Randomize input order of species"

                   - when prompted for a "Random number seed (must be odd)," enter:    9
                   - when prompted for "Number of times to jumble"  enter:
                                                                                  for small sets of data (around 20), use:           99
                                                                                  for large sets of data, use:                                11

                    The number of jumbles will affect the length of time it takes to execute large sets of data.
                     A lower number for large data sets will mean less computer time required.
                     For example, 99 jumps on 150 lines of data could take over 6 hours to execute on a 3 GHz computer.

                     For example, 11 jumps on 150 lines of data could take over 1 hour to execute on a 3 GHz computer.
                    A larger number of jumps for smaller sets of data will increase the accuracy.
        -  Type "y" to start the calculations.

        - when prompted for an "outtree" file name, 
                     - type "f" to write to a new file
                     - enter:             outtree_SURNAME_95.txt

                  If you get the error similar to:
                                                   diagonal element of row 3 of distance matrix is not zero.
                                                   Is it a distance matrix?
                  then, you probably either forgot to remove the "modal" column from the input data,
                      or forgot to enter the number of lines in the data.

                  exit the kitsch program and Go back to Step (6).

                  If you get the error similar to:
                                                   end-of-line or end-of-file in middle of species name for species xxx

                then, your line count is probably wrong.
                exit the kitsch program and Go back to Step (6).

         If the data has not been sorted into groups, this can take a while to run for large sets of data.

 - View the "tree" formatted output:

   8)   Rename the output file "outtree_SURNAME_95.txt"  to "outtree_SURNAME_95.tre"

   9)   If you have MEGA installed, then you can simply double click on the file name.
          Otherwise, start Mega and select your   *.tre  file.

  10)  In order to copy your phylogenetic tree from MEGA into a paint package:
           Cut-and-paste from the MEGA software package by selecting:

          - image from the main menu items
          - select "Copy to clipboard"

           Paste this into your favorite paint package and save it off in JPEG format. I typically use mspaint using the window tool.

      Documentation for the PHYLIP package can be found here:   http://evolution.genetics.washington.edu/phylip/phylip.html


COCKERHAM DNA TMRCA phylogenetic tree

COCKERHAM DNA network tree


