FT2PHY


by Dave Hamm, Franklin, Ohio  - copyright 2009

                                        

 Conversion of Family Tree DNA "repeat" data format into "ATGC" format.





 Designed to convert FTDNA data to ATGC format for use in other genetic software utilities.

QUICK START:


From a Windows command line window (Start/run/cmd):


   <ft2phy> <name of input file>


As in:


   ft2phy     HAM_to_paste_into_McGee.txt


Where the text file "HAM_to_paste_into_McGee.txt" would be data used to cut-and-paste into Dean McGee's Utility.

This program differs from the "ft2dna" program in that it will accept a full line of repeat values from a file that is compatible with Dean McGee's Y-DNA Utility.

The ft2phy program does NOT have a GUI interface, so you need to run it from a command line window.



Download FT2PHY   version 1.0 zip package   (compiled with GCC for Windows XP)

Package CONTENTS:

This archive should contain the files:

ft2phy.exe
FT2PHY_instructions.html                         (this file)
FT2PHY_revision_history.html
HAM_to_paste_into_McGee_Group01.txt    (example input for ft2phy)
HAM_to_paste_into_McGee_Group02.txt    (example input for ft2phy)
infile_DNAPARS_HAM_Group01.txt            (output from ft2phy, compiled into one file for PHYLIP packages)
infile_DNAPARS_HAM_Group02.txt            (output from ft2phy, compiled into one file for PHYLIP packages)


OVERVIEW:


  Overview for the use with PHYLIP "DNAPARS," "CONSENSE," PHYML, Tree Puzzle, LAMARC, etc.


FT2PHY is a set of program to convert individual DYS repeat values given by FTDNA into "ATGC" format for use with genetics programs, or any other program requiring the ATGC structure in what is known as the "PHYLIP" format.


FT2PHY was primarily  written for use with LAMARC, but with some editing, the output can be used with any number of genetics programs that require the "PHYLIP" format.


That basic format is usually of the form:


a) The first line usually consists of the number of taxa and the size (number of characters) of each line of the data.
b) The lines containing data have the first ten characters reserved for the  name identifier.
c) The data has been converted into ATGC format.



The PHYLIP package contains software to generate phylogenetic tree data.

The PHYLIP web site is located at:


   http://evolution.genetics.washington.edu/phylip.html


Lamarc does some Bayesian or Likelihood analysis with multiple genomic regions, recombination rates, and migration rates.

The LAMARC web site:

 
  
  http://evolution.gs.washington.edu/lamarc/index.html


MEGA software is used for generating Phylogenetic and Network Tree (diagrams) from the PHYLIP data.


    http://www.megasoftware.net/

At this time, I do not have a GUI interface for this program.


Installing FT2PHY:



The ft2phy program does NOT have a GUI interface, so you need to run it from a command line window.


-----------------------------------------------------
To Create an icon for the Windows Desktop:
-----------------------------------------------------

Find the MS-DOS prompt:
From your desktop find the "Start Button" on the lower left corner.
Click on:        Start > Programs > Accessories
The MS-DOS Prompt is in the Accessories Menu.
Right Click the MS-DOS Prompt Icon and drag it to your desktop and create a shortcut on your Desktop.

- Right click on the newly created Desktop icon and select "Rename."  Rename it to FT2PHY.
- Right click on the newly created Desktop icon again and select "Properties."
   Under the "Shortcut" tab, there is a box for "Start in:"  _____
   Type the full path to the FT2PHY executable in this "Start in:" box.

     example:     K:\Data\Genealogy\DNA\FT2PHY1.0

- Click on "Apply."

If you wish to change the command line options for font color or background colors, select the "Colors" tab.
(For example, I use a gray background with a blue font.)


When you are done, click on "Apply" then "OK"


RUNNING THE FT2PHY PROGRAM:



-------------------------------------------------------
If you have not created a Desktop icon:
-------------------------------------------------------


To start a "command line window" from Microsoft Windows:

 - click on "START"

 - click on "RUN"

 - type in "cmd" (without the quotes)

 - click on "OK"


 a command line window should appear.

Then, you need to "cd" to change directory to where the conversion routines are located.


  cd /D E:\Data\DNA


  when done, you can return to your default directory with:


  cd /D C:


To create the converted values for DYS439, then, run the program from a command line

interface, and provide the number of repeats as an argument:


   ft2phy <name of your data file>


As in:


   ft2phy     HAM_to_paste_into_McGee.txt 


As a reminder, for PHYLIP compatible programs, you will need to copy all 37 DYS marker data files into one file.


Output



ft2phy will produce output into a directory called "ft2phy_files." That directory should contain 37 ATGC data files, one file for each DYS translated (example:   ft2phy_CDYa.phy, ft2phy_CDYb.phy, ft2phy_DYS19.phy, etc.)

It was designed this way because LAMARC is currently my favorite program, and LAMARC's GUI data converter can import these files.


However, most genetic software programs that use PHYLIP compatible input data require the data to be placed in one file. So, for most programs, you will need to create one input file by adding the data from each of the 37 files by use of a text editor (i.e., notepad).

These individual DYS files will be in ATGC PHYLIP format.
 
For example:


 29  253
40777_WmVA ACTACTGAGTTTCTGTTATAGTGTTTTTTAATATATATATAGTATTATATATAGTGTTATATATATATA
GTGTTTTAGATAGATAGATAGGTAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATATAGTGACACTCT
CCTTAACCCAGATGGACTCCTTGTCCTCACTACATGGCCATGGCCCGAAGTATTACTCCTGGTGCCCCAGCCACTATTTC
CAGGTGCAGAGATTGACCAT????
68140_WmVA ACTACTGAGTTTCTGTTATAGTGTTTTTTAATATATATATAGTATTATATATAGTGTTATATATATATA
GTGTTTTAGATAGATAGATAGGTAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATATAGTGACACTCT
CCTTAACCCAGATGGACTCCTTGTCCTCACTACATGGCCATGGCCCGAAGTATTACTCCTGGTGCCCCAGCCACTATTTC
CAGGTGCAGAGATTGACCAT????

 (...etc.)

ft2phy will produce quite a bit of verbose output, which I have used for a few quality checks. Please ignore the verbose output, as the data is stored in files. I will remove the verbose output at a later date.


ft2phy will create a genetic Distance table in the style of Dean McGee's Y-DNA Comparison Utility. However, the calculated Genetic Distance is not quite correct. No attempt has been made to improve this. Please use Dean McGee's Utility for a more accurate Genetic Distance table.


ft2phy will try to produce output for each kit that it finds in the data file. However, if a Project member has only tested for 12 markers, that particular kit will ONLY be found in the data files for the first 12 markers. (For example, it will be absent from the files with 25 or 37 markers tested.)


Most genetic problems will have a problem trying to interpret data that does not have a consistent number of kits. That is, if you have 12 marker kits mixed in with your 37 marker data file, then most genetic software programs will complain about it.


Therefore, it is important to remember to edit your data for a consistent number of kits.
Which is to say, for best results, only use kits that have been tested for 37 markers. Or, only use kits that have been tested for 25 markers, etc. All of the kits in your input file to the "ft2phy" program should contain the same number of markers tested. The exception to this is that you can use 37 marker data with 67 marker data, because "ft2phy" will ignore the input data beyond 37 markers.




Data entry into the PHYLIP "DNAPARS" program, the format should look like this:


9  280
40777_WmVA GGCTCATTGTAGTCCACACTTGACAAAAAGTGAGACCCTGTCTGAAAGACAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAGGAAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGGAGAGAGAGAGAAAGAAAGAGAGAAAGAAAGAAAGAAAAAAGAAAGGAAGAAAAAGAGAGATATGAGTTGAAATTCC
68140_WmVA GGCTCATTGTAGTCCACACTTGACAAAAAGTGAGACCCTGTCTGAAAGACAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAGGAAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGGAGAGAGAGAGAAAGAAAGAGAGAAAGAAAGAAAGAAAAAAGAAAGGAAGAAAAAGAGAGATATGAGTTGAAATTCC
N54540_Rob GGCTCATTGTAGTCCACACTTGACAAAAAGTGAGACCCTGTCTGAAAGACAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAGGAAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGGAGAGAGAGAGAAAGAAAGAGAGAAAGAAAGAAAGAAAAAAGAAAGGAAGAAAAAGAGAGATATGAGTTGAAATTCC????????

 (... etc.)

----------------------------------------------------------------------------------


GENERAL Procedure:



Convert the FTDNA data into the format for use with Dean McGee's Y-DNA Utility, and place this into a text file.



The FTDNA data for a surname project should be given in this format:


40777 WmVA    13    22    14    10    13    14    11    14    11    14    11    30    14    8    9    8    11    23    16    20    27    12    14    15    16    11    10    19    21    14    14    16    20    36    36    11    10    11    8    15    16    9    11    10    8    9    9     12    22    25    14    10    12    12    14    8    12    25    20    13    13    11    12    11    11    12    11
68140 WmVA    13    22    14    10    13    14    11    14    11    14    11    30    14    8    9    8    11    23    16    20    27    12    14    15    16    11    10    19    21    14    14    16    20    36    36    11    10    11    8    15    16    9    11    10    8    9    9     12    22    25    14    10    12    12    14    8    12    25    20    13    13    11    12    11    11    12    11
N54540 Rob    13    22    14    10    13    14    11    14    11    14    11    30    14    8    9    8    11    23    16    20    27    12    14    15    16    11    10    19    21    14    14    16    20    34    36    11    10

where:


  - the "repeats" part of the FTDNA data would be the familiar DYS repeat values.
 The program should accept 67 marker data, but will only process up to 37 markers.

----------------------------------------------------------------------------------


The PHYLIP web site is located at:


   http://evolution.genetics.washington.edu/phylip.html





LAMARC


LAMARC can accept files for each of the 37 markers. That is to say, it can accept 37 different files in order to get the information for each individual kit. The LAMARC GUI data converter will accept PHYLIP format files for this. So, if you are using LAMARC, this program should save an enormous amount of time just in data entry.


If you have a large Project, ft2phy currently has a limit of 600 lines of data (the number of lines of data that the "ft2phy" can accept). Ideally, the data file should contain the same number of markers for each kit listed.


Finally, I have not included the data format from other vendors who may test different DYS values than does Family Tree DNA.


The steps to make the output useful involve:


 - If you have stored your data for use with Dean McGee's Y-DNA Comparison Utility, then:

 

Where the text file "HAM_to_paste_into_McGee.txt" would be data used to cut-and-paste into Dean McGee's Utility.
This file should only include kits that have been tested for 37 markers or more.
 

     Run FT2PHY on your McGee input file.


    Example:

 
          ft2phy     HAM_to_paste_into_McGee.txt 

   The data should be ready for input to LAMARC's GUI data converter. You will need to edit within LAMARC's GUI for region names, type (DNA), and you will most likely want to merge as one population. Then, save your data off from within the LAMARC GUI converter. Use the saved data from LAMARC's GUI converter in order to run LAMARC.


   (The "ft2phy" program will not generate an XML file that is compatible with LAMARC.)


 - Lamarc Quick Start:

    There is a pretty good document that comes with LAMARC that walks you through a run. It's helpful to read that.
    You can find it from the LAMARC HTML documentation index under "
How to design a LAMARC analysis"

  Here's a few tips:

  - Run the LAMARC GUI converter in order to convert the data into a data format compatible with Lamarc.
  - You need to at least load the mutating markers. One marker really won't get anything meaningful from LAMARC.
  - Name the "Regions" for each marker loaded by double clicking on each Region.
  - Set the data type to "DNA" in each of the segments by double clicking on each segment box.
  - merge the data into one "Population" by double clicking on the first population in the upper left hand corner

  Adding all of the "ft2phy" data files is easily done by using the <shift> key while selecting the "DYSxxx.phy"  file(s) to load from within the file selector. (You can load *ALL* DYS values at the same time.)

Once you get some good data loaded, you will want to save your data file.
From the lam_conv menu, select "File" then "Write Lamarc File."
(It saves the Lamarc compatible data file in XML format.)

You can then exit the converter and run the LAMARC program against the LAMARC XML data file. You can load the data file by running:

 >   lamarc   your_data_file.xml

If you double click on the Lamarc icon, it will prompt you for the input data file.
An alternate way to load the data file from within Lamarc is to use the "i" option to set the name of your input and output files.

I believe that by using the ">" option you can save your settings for each run, as it instructed on the Lamarc screen.

After bringing the Lamarc program up, in order to run it, you simply type a period, as in a period "." (without the quotes), as described on the screen. LAMARC should produce output files as described under the "i" option on as displayed on the initial screen.

The options are exlained within the LAMARC documentation.

Warning:   For a small group of 9 kits with 37 marker data, Lamarc can take about 2 days to run.



     PHYLIP   software to generate phylogenetic tree data.

     MEGA   software for generating Phylogenetic and Network Tree graphs from the PHYLIP data.


Where to find LAMARC

For example, the data can be used with the Lamarc program, if you describe the data to be

"DNA" in "Phylip" format for the Lamarc GUI data converter. See:


   http://evolution.gs.washington.edu/lamarc/index.html


Lamarc does some Bayesian or Likelihood analysis with multiple genomic regions, recombination rates, and migration rates. The Lamarc web site also has the packages Migrate, Coalesce, Fluctuate, and Recombine. Recombine needs mapping information to be reliable, but you can run the ATGC conversion data as β€œDNA” within the Lamarc program, say using these options within the β€œgui_lam_conv” utility:


 - PHYLIP format

 - DNA data

 - Genomic Region:           the DYS #

 - Population:                  Your DNA Project Group#


 LAMARC will accept different size lines, but sometimes you may want the longest line at the end of the data, and the shorter lines at the beginning.

- Dave Hamm   

Franklin, Ohio

HAM Surname DNA Project Coordinator

email: odoniv@earthlink.net

URL: http://home.earthlink.net/~odoniv/HamCountry/HAMCountry.html



 


Share on Facebook

Back to HAM Country