PHYLIP (Phylogeny Inference Package) Version 3.5c

 

by Joseph Felsenstein

 

March, 1993

 

 

 

 

COPYRIGHT NOTICE

(c) Copyright 1986-1993 by Joseph Felsenstein and the University of Washington.

Permission is granted to copy this document provided that no fee is charged for

it and that this copyright notice is not removed.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

CONTENTS OF THIS DOCUMENT

Copyright notice

Contents of this document

General description of PHYLIP

Contents of this package

What the programs do

Overview of the input and output formats

Input File Format

The Options Menu

The Output File

The Tree File

The Options and How to Invoke Them

Options Information in the Input File

Common Options in the Menu

The U (User Tree) option

The G (Global) option

The J (Jumble) option

The O (Outgroup) option

The T (Threshold) option

The M (multiple data sets) option

The option to write out the trees into a tree file

The (0) terminal type option

Common Options Requiring Information in the Input File

The Weights option

The Algorithm for Constructing Trees

Local Rearrangements

Global Rearrangements

Multiple Jumbles

Strategy for Finding the Best Tree

A Warning on Interpreting Results

Relative Speed of Different Programs and Machines

Relative speed of the different programs

Speed with different numbers of species

Relative speed of different machines

Published benchmarks

Endorsements

General Comments on Adapting the Package to Different Computer Systems

Compiling the programs

Using "make"

Getting PHYLIP onto your microcomputer

Microsoft Quick C and Microsoft C

Turbo C++ for PCDOS

Waterloo C/386

Think C for Macintosh

Unix

VMS VAX systems

Cray

IBM mainframes running CMS

Other Computer Systems

 

 

Frequently Asked Questions

"If I copied PHYLIP from a friend without you knowing, ...?"

"How do I make a citation to the PHYLIP package ...?"

"How do I bootstrap? Why has DNABOOT disappeared?"

"How do I specify a multi-species outgroup? ..."

"How do I force certain groups to remain monophyletic ...?"

"How can I reroot one of the trees written out by PHYLIP?"

"Why doesn't NEIGHBOR read my DNA sequences correctly?"

"What do I do about deletions and insertions in my sequences?"

"Why don't your parsimony programs print out branch lengths?"

"Why can't your programs handle unordered multistate characters?"

"Where can I get a printed version of the PHYLIP documents"

"Why have I been dropped from your newsletter mailing list?"

"How many copies of PHYLIP have been distributed?"

Additional Frequently Asked Questions, or:

"Why didn't it occur to you to..."

write these programs in Pascal?"

forget about all those inferior systems and just develop PHYLIP for Unix?"

write these programs in PROLOG (or Ada, or Modula-2, or Simula, or ...)?"

include in the package a program to do the Distance Wagner method ... ?

include in the package ordination methods and more clustering algorithms?"

include in the package a program to do nucleotide sequence alignment ...?"

send me the programs over the electronic network I use, BUTTERFLYNET?"

let me log in to your computer in Seattle and copy the files ....?"

send me a listing of your program?"

write a magnetic tape in our computer center's favorite format ....?"

give us a version of these in FORTRAN?"

New Features in Recent Versions

Coming Attractions, Future Plans

References for the Documentation Files

Credits

Other phylogeny programs available elsewhere

PAUP

BIOSYS-1

MacClade

Hennig86

ClaDOS

TreeAlign

Clustal

fastDNAml

VOSTORG

MEGA

Evomony

COMPONENT

Turbotree

Molevol

CLINCH

COMPROB

MARKOV

PHYSYS

SINCAIDEN

MUST

GDE

TreeTool

 

 

 

How You Can Help Me

Listserver Bulletins

In case of trouble

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

PHYLIP - Phylogeny Inference Package (version 3.5)

This is a FREE package of programs for inferring phylogenies and carrying

out certain related tasks. At present it contains 31 programs, which carry out

different algorithms on different kinds of data. The programs in the package

are:

---------- Programs for molecular sequence data ----------

PROTPARS Protein parsimony DNAPARS Parsimony method for DNA

DNAMOVE Interactive DNA parsimony DNAPENNY Branch and bound for DNA

DNACOMP Compatibility for DNA DNAINVAR Phylogenetic invariants

DNAML Maximum likelihood method DNAMLK DNA ML with molecular clock

DNADIST Distances from sequences PROTDIST Distances from proteins

RESTML ML for restriction sites SEQBOOT Bootstraps sequence data sets

COALLIKE Coalescent likelihoods from sampled phylogeny estimates

----------- Programs for distance matrix data ------------

FITCH Fitch-Margoliash and least-squares methods

KITSCH Fitch-Margoliash and least squares methods with evolutionary clock

NEIGHBOR Neighbor-joining and UPGMA methods

-------- Programs for gene frequencies and continuous characters -------

CONTML Maximum likelihood method GENDIST Computes genetic distances

CONTRAST Computes contrasts and correlations for comparative method studies

------------- Programs for 0-1 discrete state data -----------

MIX Wagner, Camin-Sokal, and mixed parsimony criteria

MOVE Interactive Wagner, C-S, mixed parsimony program

PENNY Finds all most parsimonious trees by branch-and-bound

DOLLOP, DOLMOVE, DOLPENNY same as preceding four programs, but for

the Dollo and polymorphism parsimony criteria

CLIQUE Compatibility method FACTOR recode multistate characters

---------- Programs for plotting trees and consensus trees -------

DRAWGRAM Draws cladograms and phenograms on screens, plotters and printers

DRAWTREE Draws unrooted phylogenies on screens, plotters and printers

CONSENSE Majority-rule and strict consensus trees

RETREE Reroots, changes names and branch lengths, and flips trees

There is also an Unsupported Division containing two programs, makeinf and

ProtML, which were contributed by others and are maintained by their authors.

The package includes extensive documentation files that provide the information

necessary to use and modify the programs.

The programs are written in a very standard subset of C, a language that is

available on most computers (including microcomputers). The programs require no

modifications to run on most machines: for example they work without

modification with Microsoft C, Turbo C, Think C, and on the C compilers

available on Unix and VAX VMS systems. C source code is distributed in the

regular version of PHYLIP. To use it, you must have a C compiler. A Pascal

version can also be supplied on request. Executables are available for PCDOS,

386 PCDOS, 386 Windows, and Macintoshes as described below.

NETWORK DISTRIBUTION: The package is available by "anonymous ftp" over

electronic networks (including the PCDOS, 386 PCDOS, 386 Windows, and Macintosh

executables) from evolution.genetics.washington.edu (128.95.12.41). Contact me

by electronic mail for details or start by fetching file pub/phylip/Read.Me. I

can also send the source code and documentation files (but not executables)

over Bitnet/EARN and other networks.

DISKETTE DISTRIBUTION: The package is also distributed in a variety of

microcomputer diskette formats. You should send FORMATTED diskettes, which I

will return with the package written on them. See below for how many diskettes

to send. The source code of the programs on the electronic network or magnetic

tape versions may of course also be moved to microcomputers and compiled there.

 

 

PRECOMPILED VERSIONS: Precompiled executable programs for PCDOS, 386 Windows,

386 PCDOS, and Macintosh systems are available from me. Specify the "386

Windows executable version", "386 PCDOS executable version", "PCDOS executable

version" or "Macintosh executable version" and send the number of diskettes

indicated below. Source code sent will be in C unless you specify Pascal.

HOW MANY DISKETTES TO SEND: The following table shows for different formats how

many diskettes to send, and how many extra diskettes to send for the executable

version:

Diskette size Density For source code For executables send

and documentation in addition

3.5 inch PCDOS 1.44 Mb 1 3

5.25 inch PCDOS 1.2 Mb 1 3

3.5 inch PCDOS 720 Kb 2 4

5.25 inch PCDOS 360 Kb 8 5

Macintosh 800K 2 3

Macintosh High density 1 1

Some other formats are also available. You MUST tell me EXACTLY which of these

formats you need. The diskettes MUST be formatted by you before being sent to

me. Sending an extra diskette may be helpful.

TAPE DISTRIBUTION: The programs can also be distributed on an industry-standard

1-inch magnetic tape provided by you. Contact me for details.

POLICIES: The package is distributed free. It will be written on the diskettes

or tape, which will be mailed back. They can be sent to:

Joe Felsenstein

Electronic mail addresses: Department of Genetics SK-50

Internet: joe@genetics.washington.edu University of Washington

Bitnet/EARN: felsenst@uwavm Seattle, Washington 98195, U.S.A.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

CONTENTS OF THIS PACKAGE

The source code and documentation of the package consists of 89 files,

plus 4 more for the programs in the Unsupported Division. In the electronic

mail version some of these files may be split into parts, so there may be more.

The package is organized into three major parts, the source code, the

documentation, and the unsupported programs. The documentation is organized

hierarchically, with groups of documentation files for different kinds of data

each preceded by a documentation file for the group as well. The "unsupported

division" of PHYLIP contains programs contributed by others (and not supported

by us) that we feel may of use to you.

Files Contents

---- --------

1 README -- describes the contents of the package

2 main.doc -- this general documentation file

The Source code

3 Makefile -- the "Makefile" to be used by C's that have "make"

4 Makefile.qc -- the Makefile for Microsoft C and Quick C

5 Makefile.tc -- the Makefile for Borland Turbo C and Borland C

6 phylip.h -- the PHYLIP "header file"

7 compile.com -- a VMS command file to compile all of PHYLIP

8 vaxfix.c -- procedures needed to fix VMS printf(" %hd ")

9 protpars.c -- parsimony for protein sequence data

10 dnapars.c -- DNA parsimony program

11 dnamove.c -- interactive DNA parsimony

12 dnapenny.c -- branch and bound method for DNA

13 dnacomp.c -- DNA compatibility program

14 dnainvar.c -- computation of Lake's and Cavender's invariants

15 dnaml.c -- DNA maximum likelihood program, part 1

16 dnaml2.c -- DNA maximum likelihood program, part 2

17 dnamlk.c -- DNA maximum likelihood with molecular clock

18 dnamlk2.c -- DNA maximum likelihood with clock, part 2

19 dnadist.c -- computes distance matrix from sequences

20 protdist.c -- computes distance matrix from sequences

21 restml.c -- maximum likelihood for restriction sites

22 restml2.c -- maximum likelihood for restriction sites, part 2

23 seqboot.c -- makes multiple data sets by bootstrap resampling

24 coallike.c -- coalescent likelihoods from sampled phylogenies

25 fitch.c -- Fitch-Margoliash and least-squares methods

26 kitsch.c -- F-M, L-S methods with evolutionary clock

27 neighbor.c -- neighbor-joining and UPGMA methods

28 contml.c -- maximum likelihood program

29 gendist.c -- computes genetic distances

30 contrast.c -- contrasts etc. for comparative method studies

31 mix.c -- Wagner, Camin-Sokal parsimony and mixtures, part 1

32 mix2.c -- Wagner, Camin-Sokal parsimony and mixtures, part 2

33 move.c -- interactive Wagner, Camin-Sokal and mixed parsimony

34 penny.c -- finds all most parsimonious trees

35 dollop.c -- Dollo and polymorphism parsimony methods

36 dolmove.c -- interactive Dollo and polymorphism parsimony

37 dolpenny.c -- branch and bound for Dollo, polymorphism

38 clique.c -- compatibility program

39 factor.c -- recode multistate to binary characters

40 drawgraphics.h -- header file for drawgraphics.c

41 drawgraphics.c -- routines used in both drawgram.c and drawtree.c

42 interface.h -- header for Mac interface

43 interface.c -- Mac routines used in Mac interface

44 drawgram.c -- makes plots of cladograms, phenograms

45 drawtree.c -- makes plots of unrooted phylogenies

46 font1 -- digitized font (simple sans-serif Roman)

 

 

47 font2 -- digitized font (medium quality sans-serif Roman)

48 font3 -- digitized font (high quality serifed Roman)

49 font4 -- digitized font (medium quality sans-serif Italic)

50 font5 -- digitized font (high quality serifed Italic)

51 font6 -- digitized font (Russian Cyrillic)

52 consense.c -- majority-rule and strict consensus trees

53 retree.c -- reroots, rearranges and changes lengths on trees

The Documentation

54 sequence.doc -- documentation for molecular sequence programs

55 protpars.doc -- documentation for protpars.c

56 dnapars.doc -- documentation for dnapars.c

57 dnamove.doc -- documentation for dnamove.c

58 dnapenny.doc -- documentation for dnapenny.c

59 dnacomp.doc -- documentation for dnacomp.c

60 dnainvar.doc -- documentation for dnainvar.c

61 dnaml.doc -- documentation for dnaml.c and dnaml2.c

62 dnamlk.doc -- documentation for dnamlk.c and dnamlk2.c

63 dnadist.doc -- documentation for dnadist.c

64 protdist.doc -- documentation for protdist.c

65 restml.doc -- documentation for restml.c and restml2.c

66 seqboot.doc -- documentation for seqboot.c

67 coallike.doc -- documentation for coallike.c

68 distance.doc -- documentation for distance matrix programs

69 fitch.doc -- documentation for fitch.c

70 kitsch.doc -- documentation for kitsch.c

71 neighbor.doc -- documentation for neighbor.c

72 contchar.doc -- documentation for gene frequency

and continuous character programs

73 contml.doc -- documentation for contml.c

74 gendist.doc -- documentation for gendist.c

75 contrast.doc -- documentation for contrast.c

76 discrete.doc -- documentation for discrete character programs

77 mix.doc -- documentation for mix.c

78 move.doc -- documentation for move.c

79 penny.doc -- documentation for penny.c

80 dollop.doc -- documentation for dollop.c

81 dolmove.doc -- documentation for dolmove.c

82 dolpenny.doc -- documentation for dolpenny.c

83 clique.doc -- documentation for clique.c

84 factor.doc -- documentation for factor.c

85 draw.doc -- documentation for tree plotting programs

86 drawgram.doc -- documentation for drawgram.c

87 drawtree.doc -- documentation for drawtree.c

88 consense.doc -- documentation for consense.c

89 retree.doc -- documentation for retree.c

The Unsupported Division

90 makeinf.doc -- documentation for makeinf (by Arend Sidow)

91 makeinf.c -- C source for makeinf

92 protml.doc -- documentation for ProtML (by Adachi and Hasegawa)

93 protml.pas -- Pascal source for ProtML

 

 

WHAT THE PROGRAMS DO

Here is a short description of each of the programs. For more detailed

discussion you should definitely read the documentation file for the individual

program and the documentation file for the group of programs it is in.

PROTPARS. Estimates phylogenies from protein sequences (input using the

standard one-letter code for amino acids) using the parsimony method, in

 

 

a variant which counts only those nucleotide changes that change the amino

acid, on the assumption that silent changes are more easily accomplished.

DNAPARS. Estimates phylogenies by the parsimony method using nucleic acid

sequences. Allows use the full IUB ambiguity codes, and estimates

ancestral nucleotide states. Gaps treated as a fifth nucleotide state.

DNAMOVE. Interactive construction of phylogenies from nucleic acid sequences,

with their evaluation by parsimony and compatibility and the display of

reconstructed ancestral bases. This can be used to find parsimony or

compatibility estimates by hand.

DNAPENNY. Finds all most parsimonious phylogenies for nucleic acid sequences

by branch-and-bound search. This may not be practical (depending on the

data) for more than 10 or 11 species.

DNACOMP. Estimates phylogenies from nucleic acid sequence data using the

compatibility criterion, which searches for the largest number of sites

which could have all states (nucleotides) uniquely evolved on the same

tree. Compatibility is particularly appropriate when sites vary greatly in

their rates of evolution, but we do not know in advance which are the less

reliable ones.

DNAINVAR. For nucleic acid sequence data on four species, computes Lake's and

Cavender's phylogenetic invariants, which test alternative tree topologies.

The program also tabulates the frequencies of occurrence of the different

nucleotide patterns. Lake's invariants are the method which he calls

"evolutionary parsimony".

DNAML. Estimates phylogenies from nucleotide sequences by maximum

likelihood. The model employed allows for unequal expected frequencies of

the four nucleotides, for unequal rates of transitions and transversions,

and for different (prespecified) rates of change in different categories of

sites, with the program inferring which sites have which rates.

DNAMLK. Same as DNAML but assumes a molecular clock. The use of the

two programs together permits a likelihood ratio test of the

molecular clock hypothesis to be made.

DNADIST. Computes four different distances between species from nucleic acid

sequences. The distances can then be used in the distance matrix programs.

The distances are the Jukes-Cantor formula, one based on Kimura's 2-

parameter method, Jin and Nei's distance which allows for rate variation

from site to site, and a maximum likelihood method using the model employed

in DNAML. The latter method of computing distances can be very slow.

PROTDIST. Computes a distance measure for protein sequences, using maximum

likelihood estimates based on the Dayhoff PAM matrix, Kimura's 1983

approximation to it, or a model based on the genetic code plus a

constraint on changing to a different category of amino acid. The

distances can then be used in the distance matrix programs.

RESTML. Estimation of phylogenies by maximum likelihood using restriction

sites data (not restriction fragments but presence/absence of individual

sites). It employs the Jukes-Cantor symmetrical model of nucleotide

change, which does not allow for differences of rate between transitions

and transversions. This program is VERY slow.

SEQBOOT. Reads in a data set, and produces multiple data sets from

it by bootstrap resampling. Since most programs in the current version of

the package allow processing of multiple data sets, this can be used

 

 

together with the consensus tree program CONSENSE to do bootstrap (or

delete-half-jackknife) analyses with most of the methods in this package.

This program also allows the Archie/Faith technique of permutation of

species within characters, as well as block bootstrap resampling.

COALLIKE. May be used, after using SEQBOOT and DNAMLK, to take a treefile

that they produce, and make an estimate of the likelihood curve for

the parameter 4Nu (4 times the product of effective population size and

mutation rate) when the sequences are a sample from a population

and the tree is assumed to be produced by the "coalescent" process.

FITCH. Estimates phylogenies from distance matrix data under the "additive

tree model" according to which the distances are expected to equal the sums

of branch lengths between the species. Uses the Fitch-Margoliash criterion

and some related least squares criteria. Does not assume an evolutionary

clock. This program will be useful with distances computed from DNA

sequences, with DNA hybridization measurements, and with genetic distances

computed from gene frequencies.

KITSCH. Estimates phylogenies from distance matrix data under the

"ultrametric" model which is the same as the additive tree model except

that an evolutionary clock is assumed. The Fitch-Margoliash criterion and

other least squares criteria are assumed. This program will be useful with

distances computes from DNA sequences, with DNA hybridization measurements,

and with genetic distances computed from gene frequencies.

NEIGHBOR. An implementation by Mary Kuhner and John Yamato of Saitou and

Nei's "Neighbor Joining Method," and of the UPGMA (Average Linkage

clustering) method. Neighbor Joining is a distance matrix method producing

an unrooted tree without the assumption of a clock. UPGMA does assume a

clock. The branch lengths are not optimized by the least squares criterion

but the methods are very fast and thus can handle much larger data sets.

CONTML. Estimates phylogenies from gene frequency data by maximum likelihood

under a model in which all divergence is due to genetic drift in the

absence of new mutations. Does not assume a molecular clock. An

alternative method of analyzing this data is to compute Nei's genetic

distance and use one of the distance matrix programs.

GENDIST. Computes one of three different genetic distance formulas from gene

frequency data. The formulas are Nei's genetic distance, the Cavalli-

Sforza chord measure, and the genetic distance of Reynolds et. al. The

former is appropriate for data in which new mutations occur in an infinite

isoalleles neutral mutation model, the latter two for a model without

mutation and with pure genetic drift. The distances are written to a file

in a format appropriate for input to the distance matrix programs.

CONTRAST. Reads a tree from a tree file, and a data set with continuous

characters data, and produces the independent contrasts for those

characters, for use in any multivariate statistics package. Will also

produce covariances, regressions and correlations between characters for

those contrasts.

MIX. Estimates phylogenies by some parsimony methods for discrete character

data with two states (0 and 1). Allows use of the Wagner parsimony method,

the Camin-Sokal parsimony method, or arbitrary mixtures of these. Also

reconstructs ancestral states and allows weighting of characters.

MOVE. Interactive construction of phylogenies from discrete character data

with two states (0 and 1). Evaluates parsimony and compatibility criteria

for those phylogenies and displays reconstructed states throughout the

 

 

tree. This can be used to find parsimony or compatibility estimates by

hand.

PENNY. Finds all most parsimonious phylogenies for discrete-character data

with two states, for the Wagner, Camin-Sokal, and mixed parsimony criteria

using the branch-and-bound method of exact search. May be impractical

(depending on the data) for more than 10-11 species.

DOLLOP. Estimates phylogenies by the Dollo or polymorphism parsimony criteria

for discrete character data with two states (0 and 1). Also reconstructs

ancestral states and allows weighting of characters. Dollo parsimony is

particularly appropriate for restriction sites data; with ancestor states

specified as unknown it may be appropriate for restriction fragments data.

DOLMOVE. Interactive construction of phylogenies from discrete character data

with two states (0 and 1) using the Dollo or polymorphism parsimony

criteria. Evaluates parsimony and compatibility criteria for those

phylogenies and displays reconstructed states throughout the tree. This

can be used to find parsimony or compatibility estimates by hand.

DOLPENNY. Finds all most parsimonious phylogenies for discrete-character data

with two states, for the Dollo or polymorphism parsimony criteria using the

branch-and-bound method of exact search. May be impractical (depending on

the data) for more than 10-11 species.

CLIQUE. Finds the largest clique of mutually compatible characters, and the

phylogeny which they recommend, for discrete character data with two

states. The largest clique (or all cliques within a given size range of

the largest one) are found by a very fast branch and bound search method.

The method does not allow for missing data. For such cases the T

(Threshold) option of MIX may be a useful alternative. Compatibility

methods are particular useful when some characters are of poor quality and

the rest of good quality, but when it is not known in advance which ones

are which.

FACTOR. Takes discrete multistate data with character state trees and

produces the corresponding data set with two states (0 and 1). Written by

Christopher Meacham.

DRAWGRAM. Plots rooted phylogenies, cladograms, and phenograms in a

wide variety of user-controllable formats. The program is

interactive and allows previewing of the tree on PC graphics screens,

and Tektronix or DEC graphics terminals. Final output can be on

a laser printer (such as the Apple Laserwriter or HP Laserjet),

on graphics screens or terminals, on pen plotters (Hewlett-Packard or

Houston Instruments) or on dot matrix printers capable of graphics

(Epson, Okidata, Imagewriter, or Toshiba).

DRAWTREE. Similar to DRAWGRAM but plots unrooted phylogenies.

CONSENSE. Computes consensus trees by the majority-rule consensus tree

method, which also allows one to easily find the strict consensus tree.

Does NOT compute the Adams consensus tree. Trees are input in a tree file

in standard nested-parenthesis notation, which is produced by many of the

tree estimation programs in the package when the Y option is invoked.

This program can be used as the final step in doing bootstrap analyses for

many of the methods in the package.

RETREE. Reads in a tree (with branch lengths if necessary) and allows

you to reroot the tree, to flip branches, to change species names and

branch lengths, and then write the result out. Can be used to convert

 

 

between rooted and unrooted trees.

Programs in the Unsupported Division

The Unsupported Division of PHYLIP consists of two programs contributed by

others that may be useful to you and have kindly been contributed by their

authors. Those authors retain full copyright to their programs and

documentation files. They are provided in the PHYLIP source code distribution

but have not been provided as executables in the executables distribution. All

questions about these programs should be directed to their authors, whose

electronic mail addresses and regular mail addresses are given in their

documentation files.

MAKEINF. This program by Arend Sidow can be used to translate the output files

from Jotun Hein's popular multiple-sequence alignment program into PHYLIP input

files. It also allows you to selectively analyze different codon positions and

different organisms. The output from other alignment programs can rather

easily be edited into a form that it will read.

PROTML. This large Pascal program from Jun Adachi and Masami Hasegawa carries

out maximum likelihood estimation of phylogenies from protein sequence data.

It is quite analogous to DNAML, but uses instead of a model for DNA evolution

the PAM matrix model of Margaret Dayhoff. Because of the larger number of

states (20 instead of 4) it is necessarily slower than DNAML by a large factor.

However the authors have adopted a different, and faster, rearrangement

strategy to search among tree topologies for the best one. ProtML does not yet

incorporate the Categories feature of DNAML and DNAMLK which allows different

rates of evolution at different sites, without the user specifying in advance

which site has which rate of evolution. For support, contact them at the

Internet addresses hasegawa@ism.ac.jp and adachi@sunmh.ism.ac.jp at the

Institute of Statistical Mathematics, Tokyo, Japan.

 

 

OVERVIEW OF THE INPUT AND OUTPUT FORMATS

When you run most of these programs, a menu will appear offering you

choices of the various options available for that program. The data that the

program reads should be in an input file called (in most cases) "infile". If

there is no such file the programs will ask you for the name of the input file.

Below we describe the input file format, and then the menu.

Input File Format

----- ---- ------

I have tried to adhere to a rather stereotyped input and output format.

For the parsimony, compatibility and maximum likelihood programs, excluding the

distance matrix methods, the simplest version of the input file looks something

like this:

6 13

Archaeopt CGATGCTTAC CGC

HesperorniCGTTACTCGT TGT

BaluchitheTAATGTTAAT TGT

B. virginiTAATGTTCGT TGT

BrontosaurCAAAACCCAT CAT

B.subtilisGGCAGCCAAT CAC

The first line of the input file contains the number of species and the

number of characters, in free format, separated by blanks (not by

commas). The information for each species follows, starting with a

 

 

ten-character species name (which can include punctuation marks and blanks),

and continuing with the characters for that species. In the

discrete-character, DNA and protein sequence programs the characters are each a

single letter or digit, sometimes separated by blanks. In

the continuous-characters programs they are real numbers with decimal points,

separated by blanks:

Latimeria 2.03 3.457 100.2 0.0 -3.7

The conventions about continuing the data beyond one line per species are

different between the molecular sequence programs and the others. The

molecular sequence programs can take the data in "aligned" or "interleaved"

format, with some lines giving the first part of each of the sequences, then

lines giving the next part of each, and so on. Thus the sequences might look

like this:

6 39

Archaeopt CGATGCTTAC CGCCGATGCT

HesperorniCGTTACTCGT TGTCGTTACT

BaluchitheTAATGTTAAT TGTTAATGTT

B. virginiTAATGTTCGT TGTTAATGTT

BrontosaurCAAAACCCAT CATCAAAACC

B.subtilisGGCAGCCAAT CACGGCAGCC

TACCGCCGAT GCTTACCGC

CGTTGTCGTT ACTCGTTGT

AATTGTTAAT GTTAATTGT

CGTTGTTAAT GTTCGTTGT

CATCATCAAA ACCCATCAT

AATCACGGCA GCCAATCAC

Note that in these sequences we have a blank every ten sites to make them

easier to read: any such blanks are allowed. The blank line which separates

the two groups of lines (the ones containing sites 1-20 and ones containing

sites 21-39) may or may not be present, but if it is, it should be a line of

zero length and not contain any extra blank characters (this is because of a

limitation of the current versions of the programs). It is important that the

number of sites in each group be the same for all species (i.e., it will not be

possible to run the programs successfully if the first species line contains 20

bases, but the first line for the second species contains 21 bases).

Alternatively, an option can be selected to take the data in "sequential"

format, with all of the data for the first species, then all of the characters

for the next species, and so on. This is also the way that the discrete

characters programs and the gene frequencies and quantitative characters

programs want to read the data. They do not allow the "interleaved" format.

In the sequential format, the character data can run on to a new line at

any time (except in a species name or in the case of continuous character and

distance matrix programs where you cannot go to a new line in the middle of a

real number). Thus it is legal to have:

Archaeopt 001100

1101

or even:

 

 

 

 

 

 

 

Archaeopt

0011001101

though note that the FULL ten characters of the species name MUST then be

present: in the above case there must be a blank after the "t". In all cases

it is possible to put internal blanks between any of the character values, so

that

Archaeopt 0011001101 0111011100

is allowed.

If you make an error in the input file, the programs will often detect that

they have been fed an illegal character or illegal numerical value and issue an

error message such as "BAD CHARACTER STATE:", often printing out the bad value,

and sometimes the number of the species and character in which it occurred.

The program will then stop shortly after. One of the things which can lead to

a bad value is the omission of something earlier in the file, or the insertion

of something superfluous, which cause the reading of the file to get out of

synchronization. The program then starts reading things it didn't expect, and

concludes that they are in error. So if you see this error message, you may

also want to look for the earlier problem that may have led to this.

The other major variation on the input data format is the options

information. Many options are selected using the menu, but a few are selected

by including extra information in the input file. Some options are described

below.

 

The Options Menu

--- ------- ----

The menu is straightforward. It typically looks like this (this one is

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

for DNAPARS):

 

DNA parsimony algorithm, version 3.5c

Setting for this run:

U Search for best tree? Yes

J Randomize input order of sequences? No. Use input order

O Outgroup root? No, use as outgroup species 1

T Use Threshold parsimony? No, use ordinary parsimony

M Analyze multiple data sets? No

I Input sequences interleaved? Yes

0 Terminal type (IBM PC, VT52, ANSI)? ANSI

1 Print out the data at start of run No

2 Print indications of progress of run Yes

3 Print out tree Yes

4 Print out steps in each site No

5 Print sequences at all nodes of tree No

6 Write out trees onto tree file? Yes

Are these settings correct? (type Y or the letter for one to change)

If you want to accept the default settings (they are shown in the above case)

you can simply type "Y" followed by a carriage-return (Enter) character. If

you want to change any of the options, you should type the letter shown to the

left of its entry in the menu. For example, to set a threshold type "T".

Lower-case letters will also work. For many of the options the program will

ask for supplementary information, such as the value of the threshold.

Note the "Terminal type" entry, which you will find on all menus. It

allows you to specify which type of terminal your screen is. The options are

an IBM PC screen, an ANSI standard terminal (such as a DEC VT100), a DEC VT52-

compatible terminal, such as a Zenith Z29, or no terminal type. Choosing "0"

toggles among these four options in cyclical order, changing each time the "0"

option is chosen. If one of them is right for your terminal the screen will be

cleared before the menu is displayed. If none works the "none" option should

probably be chosen. Keep in mind that VT-52 compatible terminals can freeze up

if they receive the screen-clearing commands for the ANSI standard terminal!

If this is a problem it may be helpful to recompile the program, setting the

constants near its beginning so that the program starts up with the VT52 option

set.

The other numbered options control which information the program will

display on your screen or on the output files. The option to "Print

indications of progress of run" will show information such as the names of the

species as they are successively added to the tree, and the progress of global

rearrangements. You will usually want to see these as reassurance that the

program is running and to help you estimate how long it will take. But if you

are running the program "in background" as can be done on multitasking and

multiuser systems such as Unix, and do not have the program running in its own

window, you may want to turn this option off so that it does not disturb your

use of the computer while the program is running.

 

The Output File

--- ------ ----

Most of the programs write their output onto a file called (usually)

"outfile", and a representation of the trees found onto a file called

"treefile".

 

 

 

The exact contents of the output file vary from program to program and

also depend on which menu options you have selected. For many programs, if you

select all possible output information, the output will consist of (1) the name

of the program and its version number, (2) the input information printed out,

(3) a series of phylogenies, some with associated information indicating how

much change there was in each character or on each part of the tree. A typical

rooted tree looks like this:

+-------------------Gibbon

+----------------------------2

! ! +------------------Orang

! +------4

! ! +---------Gorilla

+-----3 +--6

! ! ! +---------Chimp

! ! +----5

--1 ! +-----Human

! !

! +-----------------------------------------------Mouse

!

+------------------------------------------------Bovine

The interpretation of the tree is fairly straightforward: it "grows" from left

to right. The numbers at the forks are arbitrary and are used (if present)

merely to identify the forks. In some of the programs asterisks ("*") are used

instead of numbers. For many of the programs the tree produced is unrooted.

It is printed out in nearly the same form, but with a warning message:

remember: this is an unrooted tree!

The warning message ("remember: ...") indicates that this is an unrooted tree

(mathematicians still call this a tree, though some systematists unfortunately

use the term "network". This conflicts with standard mathematical usage, which

reserves the name "network" for a completely different kind of graph). The

root of this tree could be anywhere, say on the line leading immediately to

Mouse. As an exercise, see if you can tell whether the following tree is or is

not a different one from the above:

+-----------------------------------------------Mouse

!

+---------4 +------------------Orang

! ! +------3

! ! ! ! +---------Chimp

---6 +----------------------------1 ! +----2

! ! +--5 +-----Human

! ! !

! ! +---------Gorilla

! !

! +-------------------Gibbon

!

+-------------------------------------------Bovine

remember: this is an unrooted tree!

(it is NOT different). It is IMPORTANT also to realize that the lengths of the

segments of the printed tree may not be significant: some may actually

represent branches of zero length, in the sense that there is no evidence that

the branches are nonzero in length. Some of the diagrams of trees attempt to

print branches approximately proportional to estimated branch lengths, while in

others the lengths are purely conventional and are presented just to make the

topology visible. You will have to look closely at the documentation that

 

 

accompanies each program to see what it presents and what is known about the

lengths of the branches on the tree. The above tree attempts to represent

branch lengths approximately in the diagram. But even in those cases, some of

the smaller branches are likely to be artificially lengthened to make the tree

topology clearer. Here is what a tree from DNAPARS looks like, when no attempt

is made to make the lengths of branches in the diagram proportional to

estimated branch lengths:

+--Human

+--5

+--4 +--Chimp

! !

+--3 +-----Gorilla

! !

+--2 +--------Orang

! !

+--1 +-----------Gibbon

! !

--6 +--------------Mouse

!

+-----------------Bovine

remember: this is an unrooted tree!

Some of the parsimony programs in the package can print out a table of the

number of steps that different characters (or sites) require on the tree. This

table may not be obvious at first. A typical example looks like this:

steps in each site:

0 1 2 3 4 5 6 7 8 9

*-----------------------------------------

0! 2 2 2 2 1 1 2 2 1

10! 1 2 3 1 1 1 1 1 1 2

20! 1 2 2 1 2 2 1 1 1 2

30! 1 2 1 1 1 2 1 3 1 1

40! 1

The numbers across the top and down the side indicate which site is being

referred to. Thus site 23 is column "3" of row "20" and has 2 steps in this

case.

 

The Tree File

--- ---- ----

In output from most programs, a representation of the tree is also written

into the tree file (usually named "treefile"). The tree is specified by the

nested pairs of parentheses, enclosing names and separated by commas. If there

are any blanks in the names, these must be replaced by the underscore character

"_". Trailing blanks in the name may be omitted. The pattern of the

parentheses indicates the pattern of the tree by having each pair of

parentheses enclose all the members of a monophyletic group. The tree file for

the above tree would have its first line look like this:

((Mouse,Bovine),((Orang,(Gorilla,(Chimp,Human))),Gibbon));

In the above tree the first fork separates the lineage leading to Mouse and

Bovine from the lineage leading to the rest. Within the latter group there is

a fork separating Gibbon from the rest, and so on. The entire tree is enclosed

in an outermost pair of parentheses. The tree ends with a semicolon. In some

programs such as DNAML, FITCH, and CONTML, the tree will be completely unrooted

 

 

and specified by a bottommost fork with a three-way split, with three

"monophyletic" groups separated by two commas:

(A,(B,(C,D)),(E,F));

The three "monophyletic" groups here are A, (B,C,D), and (E,F). The single

three-way split corresponds to one of the interior nodes of the unrooted tree

(it can be any interior node). The remaining forks are encountered as you move

out from that first node, and each then appears as a two-way split. You should

check the documentation files for the particular programs you are using to see

in which of these forms you can expect the user tree to be in. Note that many

of the programs that estimate an unrooted tree produce trees in the treefile in

rooted form! This is done for reasons of arbitrary internal bookkeeping. The

placement of the root is arbitrary.

For programs estimating branch lengths, these are given in the trees in

the tree file as real numbers following a colon, and placed immediately after

the group descended from that branch. Here is a typical tree with branch

lengths:

((cat:47.14069,(weasel:18.87953,((dog:25.46154,(raccoon:19.19959,

bear:6.80041):0.84600):3.87382,(sea_lion:11.99700,

seal:12.00300):7.52973):2.09461):20.59201):25.0,monkey:75.85931);

Note that the tree may continue to a new line at any time except in the middle

of a name or the middle of a branch length, although in trees written to the

tree file this will only be done after a comma.

These representations of trees are a subset of the standard adopted on

June 24, 1986 at the annual meetings of the Society for the Study of Evolution

at an meeting (the final session in a local lobster restaurant) of an informal

committee consisting of Wayne Maddison (MacClade), David Swofford (PAUP), F.

James Rohlf (NTSYS-PC), Chris Meacham (COMPROB and plotting programs), James

Archie (character coding program), William H.E. Day, and me. This standard is

a generalization of PHYLIP's format, itself based on a well-known

representation of trees in terms of parenthesis patterns which has been around

for almost a century. The standard is now employed by most phylogeny computer

programs but unfortunately has yet to be decribed in a formal published

description.

 

 

THE OPTIONS AND HOW TO INVOKE THEM

Most of the programs allow various options that alter the amount of

information the program is provided or what it is to do with the information.

Most options are selected in the menu. However a few are specified in the

input file, or require part of their specification to be in the input file.

 

Options Information in the Input File

------- ----------- -- --- ----- ----

In such cases, the program is notified that an option has been invoked by

the presence of one or more letters after the last number on the first line of

the input file. These letters may or may not be separated from each other by

blanks, though it is usually necessary to separate them from the number by a

blank. They can be in any order. Thus to invoke options A and W, the input

file starts with the line:

 

 

 

 

12 20 WA

or:

12 20 A W

The options are described individually in the other documents of this package.

For the options that require information to be in the input file, additional

information must be provided. For all but one of these, this information is

provided by placing a line after the first line of the file, but before the

beginning of the species data. The first character of that line should match

the option letter. These auxiliary information lines can be in any order.

Thus if options A and W are both invoked, both of the following formats (and

two others as well) are legal:

12 20 AW 12 20 A W

A 0001111000 Weights 00112221A0

Weights 00112221A0 A 0001111000

(then the species information) (then the species information)

One of the options requires special discussion. Many of the programs have in

their menu the option U, which signals that one or more user-defined trees is

to be provided for evaluation. This "user tree" is supplied in the input file

(not the tree file), but AFTER the species data, rather than before it. It does

not require any indication to be placed in the first line of the input file, as

do the options that place information before the species data. After the data,

there is a line containing the number of user-defined trees being defined.

Each user-defined tree starts on a new line. It is in the same form as the

trees in the tree files mentioned above, namely the New Hampshire standard.

Here is an example with one user-defined tree:

6 13

Archaeopt 0011001110000

Hesperorni0001101101101

Baluchithe1111011011101

B. virgini1111011101101

Brontosaur0110100111011

B.subtilis0000000011010

1

((B.subtilis,Baluchithe),((Brontosaur,B._virgini),

(Hesperorni,Archaeopt)));

 

In using the user tree option, check the pattern of parentheses carefully.

The programs do not always detect whether the tree makes sense, and if it does

not there will probably be a crash (hopefully, but not inevitably, with an

error message indicating the nature of the problem).

 

Common Options in the Menu

------ ------- -- --- ----

Seven options from the menu, the U (User tree), G (Global), J (Jumble), O

(Outgroup), T (Threshold), M (multiple data sets), and the tree output options,

are used so widely that it is best to discuss them in this document.

(1) The U (User tree) option. This option toggles between the default

setting, which allows the program to search for the best tree, and the User

tree setting, which reads a tree or trees ("user trees") from the input file

and evaluates them. The user trees must follow the other information in the

data set, and be preceded by a line specifying the number to user trees that

are to be evaluated. Each user tree then is given in standard form, each

starting on a new line. The form that the user trees must take is described in

 

 

some detail below, under the description of the program output of tree files.

In some cases a program may require that the trees fed in be rooted trees, even

though the program cannot infer the placement of the root. In those cases you

can place the root anywhere. Program RETREE can be used to convert between

rooted and unrooted trees.

(2) The G (Global) option. In the programs which construct trees (except

for NEIGHBOR, the "...PENNY" programs and CLIQUE, and of course the "...MOVE"

programs where you construct the trees yourself), after all species have been

added to the tree a rearrangements phase ensues. In most of these programs the

rearrangements are automatically global, which in this case means that subtrees

will be removed from the tree and put back on in all possible ways so as to

have a better chance of finding a better tree. Since this can be time

consuming (it roughly triples the time taken for a run) it is left as an option

in some of the programs, specifically CONTML, FITCH, and DNAML. In these

programs the G menu option toggles between the default of local rearrangement

and global rearrangement. The rearrangements are explained more below.

(3) The J (Jumble) option. In most of the tree construction programs

(except for the "...PENNY" programs and CLIQUE), the exact details of the

search of different trees depend on the order of input of species. In these

programs J option enables you to tell the program to use a random number

generator to choose the input order of species. This option is toggled on and

off by selecting option J in the menu. The program will then prompt you for a

"seed" for the random number generator. The seed should be an integer between

1 and 32767, and should of form 4n+1, which means that it must give a remainder

of 1 when divided by 4. This can be judged by looking at the last two digits

of the number. Each different seed leads to a different sequence of addition

of species. By simply changing the random number seed and re-running the

programs one can look for other, and better trees. If the seed entered is not

odd, the program will not proceed, but will prompt for another seed.

The Jumble option also causes the program to ask you how many times you

want to restart the process. If you answer 10, the program will try ten

different orders of species in constructing the trees, and the results printed

out will reflect this entire search process (that is, the best trees found

among all 10 runs will be printed out, not the best trees from each individual

run).

(4) The O (Outgroup) option. This specifies which species is to be used

to root the tree by having it become the outgroup. This option is toggled on

and off by choosing O in the menu. When it is on, the program will then prompt

for the number of the outgroup (the species being taken in the numerical order

that they occur in the input file). Responding by typing "6" and then a

carriage-return (Enter) character indicates that the sixth species in the data

is the outgroup. Outgroup-rooting will not be attempted if the data have

already established a root for the tree from some other consideration, and may

not be if it is a user-defined tree, despite your invoking the option. Thus

programs such as DOLLOP that produce only rooted trees do not allow the

Outgroup option. It is also not available in KITSCH, DNAMLK, or CLIQUE. When

it is used, the tree as printed out is still listed as being an unrooted tree,

though the outgroup is connected to the bottommost node so that it is easy to

visually convert the tree into rooted form.

(5) The T (Threshold) option. This sets a threshold such that if the

number of steps counted in a character is higher than the threshold, it will be

taken to be the threshold value rather than the actual number of steps. The

default is a threshold so high that it will never be surpassed. The T menu

option toggles on and off asking the user to supply a threshold. The use of

thresholds to obtain methods intermediate between parsimony and compatibility

methods is described in my 1981b paper. When the T option is in force, the

 

 

program will prompt for the numerical threshold value. This will be a positive

real number greater than 1. In programs MIX, MOVE, PENNY, PROTPARS, DNAPARS,

DNAMOVE, and DNAPENNY, do not use threshold values less than or equal to 1.0,

as they have no meaning and lead to a tree which depends only on considerations

such as the input order of species and not at all on the character state data!

In programs DOLLOP, DOLMOVE, and DOLPENNY the threshold should never be 0.0 or

less, for the same reason. The T option is an important and underutilized one:

it is, for example, the only way in this package (except for program DNACOMP)

to do a compatibility analysis when there are missing data. It is a method of

de-weighting characters that evolve rapidly. I wish more people were aware of

its properties.

(6) The M (Multiple data sets) option. In menu programs there is an M

menu option which allows one to toggle on the multiple data sets option. The

program will ask you how many data sets it should expect. The data sets have

the same format as the first data set. Here is a (very small) input file with

two five-species data sets:

5 6

Alpha CCACCA

Beta CCAAAA

Gamma CAACCA

Delta AACAAC

Epsilon AACCCA

5 6

Alpha CACACA

Beta CCAACC

Gamma CAACAC

Delta GCCTGG

Epsilon TGCAAT

The main use of this option will be to allow all of the methods in these

programs to be bootstrapped. Using the program SEQBOOT one can take any DNA,

protein, restriction sites, or binary character data set and make multiple data

sets by bootstrapping. Trees can be produced for all of these using the M

option. They will be written on the tree output file if that option is left in

force. Then the program CONSENSE can be used with that tree file as its input

file. The result is a majority rule consensus tree which can be used to make

confidence intervals. The present version of the package allows, with the use

of SEQBOOT and CONSENSE and the M option, bootstrapping of many of the methods

in the package.

(7) The option to write out the trees into a tree file. This specifies

that you want the program to write out the tree not only on its usual output,

but also onto a file in nested-parenthesis notation (as described above). This

option is sufficiently useful that it is turned on by default in all programs

that allow it. You can optionally turn it off if you wish, by typing the

appropriate number from the menu (it varies from program to program). This

option is useful for creating tree files that can be directly read into the

plotting programs, the consensus tree program, and can be incorporated into the

input file to specify user-defined trees in many of the other programs.

(8) The (0) terminal type option. The program will default to one

particular assumption about your terminal (except in the case of Macintoshes,

the default will be an ANSI compatible terminal). You can alternatively select

it to be either an IBM PC, a DEC VT52, or nothing. This affects the ability of

the programs to clear the screen when they display their menus, and the

graphics characters used to display trees in the programs DNAMOVE, MOVE,

DOLMOVE, and RETREE. If you are running a PCDOS system any have the ANSI.SYS

driver installed in your CONFIG.SYS file, you may find that the screen clears

correctly even with the default setting of ANSI.

 

 

Common Options Requiring Information in the Input File

------ ------- --------- ----------- -- --- ----- ----

There are a number of options (Ancestor, Factors, Categories and Weights)

that are specified in the input file. Some of them must also be selected in

the menu. Of these, the Ancestor and Factors options are specific to the

Discrete Characters programs and are described in their group document. The

Categories option is specific to some of the molecular sequence programs and is

described in their group document. The Weights option is used throughout the

package and is best introduced here.

This allows us to specify weights on the individual characters. Weights

are invoked by placing a W on the first line of the file. The weights are then

specified by a line or lines which start with W and then have enough characters

or blanks to complete the full length of a species name. Then they have a

single character (0-9 or A-Z) for each character. Thus they look like the data

for a species:

Weights 0001111001112

or:

W 1110000ZZZZZ1

The weights cause a character to be counted as if it were n characters, where n

is the weight. The values 0-9 give weights 0 through 9, and the values A-Z

give weights 10 through 35. By use of the weights we can give overwhelming

weight to some characters, and drop others from the analysis. In the molecular

sequence programs only two values of the weights, 0 or 1 are allowed.

Weights can be used to analyze different subsets of characters (by

weighting the rest as zero). Alternatively, in the discrete characters

programs they can be used to force a certain group to appear on the phylogeny

(in effect confining consideration to only phylogenies containing that group).

This is done by adding an imaginary character that has 1's for the members of

the group, and 0's for all the other species. That imaginary character is then

given the highest weight possible: the result will be that any phylogeny that

does not contain that group will be penalized by such a heavy amount that it

will not (except in the most unusual circumstances) be considered. Of course,

the new character brings extra steps to the tree, but the number of these can

be calculated in advance and subtracted out of the total when reporting the

results. This use of weights is an important one, and one sadly ignored by

many users who could profit from it. In the case of molecular sequences we

cannot use weights this way, so that to force a given group to appear we have

to add a large extra segment of sites to the molecule, with (say) A's for that

group and C's for every other species.

 

 

THE ALGORITHM FOR CONSTRUCTING TREES

All of the programs except FACTOR, DNADIST, GENDIST, DNAINVAR, SEQBOOT,

CONTRAST, RETREE, COALLIKE and the plotting and consensus tree programs act to

construct an estimate of a phylogeny. MOVE, DOLMOVE, and DNAMOVE let you

construct it yourself by hand. All of the rest but NEIGHBOR, the "...PENNY"

programs and CLIQUE make use of a common approach involving additions and

rearrangements. They are trying to minimize or maximize some quantity over the

space of all possible evolutionary trees. Each program contains a part that,

given the topology of the tree, evaluates the quantity that is being minimized

or maximized. The straightforward approach would be to evaluate all possible

tree topologies one after another and pick the one which, according to the

 

 

criterion being used, is best. This would not be possible for more than a

small number of species, since the number of possible tree topologies is

enormous. A review of the literature on the counting of evolutionary trees

will be found one of my papers (Felsenstein, 1978a).

Since we cannot search all topologies, these programs are not guaranteed

to always find the best tree, although they seem to do quite well in practice.

The strategy they employ is as follows: the species are taken in the order in

which they appear in the input file. The first two (in some programs the first

three) are taken and a tree constructed containing only those. There is only

one possible topology for this tree. Then the next species is taken, and we

consider where it might be added to the tree. If the initial tree is (say) a

rooted tree with two species and we want the resulting three-species tree to be

a bifurcating tree, there are only three places where we could add the third

species. Each of these is tried, and each time the resulting tree is evaluated

according to the criterion. The best one is chosen to be the basis for further

operations. Now we consider adding the fourth species, again at each of the

five possible places that would result in a bifurcating tree. Again, the best

of these is accepted.

 

Local Rearrangements

----- --------------

The process continues in this manner, with one important exception. After

each species is added, and before the next is added, a number of rearrangements

of the tree are tried, in an effort to improve it. The algorithms move through

the tree, making all possible local rearrangements of the tree. A local

rearrangement involves an internal segment of the tree in the following manner.

Each internal segment of the tree is of this form (where T1, T2, and T3 are

subtrees -- parts of the tree that can contain further forks and tips):

T1 T2 T3

\ / /

\ / /

\ / /

\/ /

* /

* /

* /

* /

*

!

!

the segment we are discussing being indicated by the asterisks. A local

rearrangement consists of switching the subtrees T1 and T3 or T2 and T3, so as

to obtain one of the following:

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

T3 T2 T1 T1 T3 T2

\ / / \ / /

\ / / \ / /

\ / / \ / /

\ / / \ / /

\ / \ /

\ / \ /

\ / \ /

\ / \ /

! !

! !

! !

Each time a local rearrangement is successful in finding a better tree, the new

arrangement is accepted. The phase of local rearrangements does not end until

the program can traverse the entire tree, attempting local rearrangements,

without finding any that improve the tree.

This strategy of adding species and making local rearrangements will look

at about (n-1) times (2n-3) different topologies, though if rearrangements are

frequently successful the number may be larger. I have been describing the

strategy when rooted trees are being considered. For unrooted trees there is a

precisely similar strategy, though the first tree constructed may be a three-

species tree and the rearrangements may not start until after the addition of

the fifth species.

Though we are not guaranteed to have found the best tree topology, we are

guaranteed that no nearby topology (i. e. none accessible by a single local

rearrangement) is better. In this sense we have reached a local optimum of our

criterion. Note that the whole process is dependent on the order in which the

species are present in the input file. We can try to find a different and

better solution by reordering the species in the input file and running the

program again (or, more easily, by using the J option). If none of these

attempts finds a better solution, then we have some indication that we may have

found the best topology, though we can never be certain of this.

Note also that a new topology is never accepted unless it is better than

the previous one, so that the rearrangement process can never fall into an

endless loop. This is also the way ties in our criterion are resolved, namely

by sticking with the tree found first. However, the tree construction programs

other than CLIQUE, CONTML, FITCH, and DNAML do keep a record of all trees found

that are tied with the best one found. This gives you some immediate idea of

which parts of the tree can be altered without affecting the quality of the

result.

 

Global Rearrangements

------ --------------

A feature of most of the programs, such as PROTPARS, DNAPARS, DNACOMP,

DNAML, DNAMLK, RESTML, KITSCH, FITCH, CONTML, MIX, and DOLLOP, is "global"

optimization of the tree. In four of these (CONTML, FITCH, DNAML and DNAMLK)

this is an option, 'G'. In the others it automatically applies. When it is

present there is an additional stage to the search for the best tree. Each

possible subtree is removed from the tree from the tree and added back in all

possible places. This process continues until all subtrees can be removed and

added again without any improvement in the tree. The purpose of this extra

rearrangement is to make it less likely that one or more a species gets "stuck"

in a suboptimal region of the space of all possible trees. The use of global

optimization results in approximately a tripling (3x) of the run-time, which is

why I have left it as an option in some of the slower programs.

 

 

The programs doing global optimization print out a dot "." after each

group is removed and re-added to the tree, to give the user some sign that the

rearrangements are proceeding. A new line of dots is started whenever a new

round of global rearrangements is started following an improvement in the tree.

On the line before the dots are printed there is printed a bar of the form

"!--------------!" to show how many dots to expect. The dots will not be

printed out at a uniform rate, but the later dots, which represent removal of

larger groups from the tree and trying them consequently in fewer places, will

print out more quickly. With some compilers each row of dots is not printed

out until it is complete.

It should be noted that PENNY, DOLPENNY, DNAPENNY and CLIQUE use a more

sophisticated strategy of "depth-first search" with a "branch and bound" search

method that guarantees that all of the best trees will be found. In the case

of PENNY, DOLPENNY and DNAPENNY there can be a considerable sacrifice of

computer time if the number of species is greater than about ten: it is a

matter for you to consider whether it is worth it for you to guarantee finding

all the most parsimonious trees, and that depends on how much free computer

time you have! CLIQUE finds all largest cliques, and does so without undue

burning of computer time.

 

Multiple Jumbles

-------- -------

As just mentioned, for most of these programs the search depends on the

order in which the species are entered into the tree. Using the J (Jumble)

option you can supply a random number seed which will allow the program to put

the species in in a random order. A new feature (with version 3.5) is to allow

this to be done multiple times. If you tell the program to do it 10 times, it

will go through the tree-building process 10 times, each with a different

random order of adding species. It will keep a record of the trees tied for

best over the whole process. In other words, it does not just record the best

trees from each of the 10 runs, but records the best ones overall. Of course

this is slow, taking 10 times longer than a single run. But it does give us a

much greater chance of finding all of the most parsimonious trees. In the

terminology of Maddison (1991) it can find different "islands" of trees. The

present algorithms do not guarantee us to find all trees in a given "island"

from a single run, so multiple runs also help explore those "islands" that are

found.

 

 

STRATEGY FOR FINDING THE BEST TREE

In practice, it is advisable to use the Jumble option to evaluate many

different orderings of the input species. When the programs which have global

branch-swapping as default (such as DNAPARS) are used or when the G option is

employed in other programs IT IS ADVISABLE TO USE THE JUMBLE OPTION AND SPECIFY

THAT IT BE DONE MANY TIMES (AS MANY AS TEN) to use different orderings of the

input species). When the G (Global rearrangement) option is not being used I

have also found it useful to do multiple Jumbles.

People who want a magic "black box" program whose results they do not have

to question (or think about) often are upset that these programs give results

that are dependent on the order in which the species are entered in the data.

To me this property is an advantage, for it permits you to try different

searches for better trees, simply by varying the input order of species. If

you do not use the multiple Jumble option, but do multiple individual runs

instead, you can easily decide which to pay most attention to -- the one or

ones that are best according to the criterion employed (for example, with

 

 

parsimony, the one out of the runs that results in the tree with the fewest

changes).

In practice, in a single run, it usually seems best to put species that

are likely to be sources of confusion in the topology last, as by the time they

are added the arrangement of the earlier species will have stabilized into a

good configuration, and then the last few species will by fitted into that

topology. There will be less chance this way of a poor initial topology that

would affect all subsequent parts of the search. However, a variety of

arrangements of the input order of species should be tried, as can be done if

the J option is used, and no species should be kept in a fixed place in the

order of input. Note that the results of the "...PENNY" programs and CLIQUE

are not sensitive to the input order of species, and NEIGHBOR is only slightly

sensistive to it, so that multiple Jumbling is not possible with those

programs. Note also that with global search, which is standard in many

programs and in others is an option, each group (including each individual

species) will be removed and re-added in all possible positions, so that a

species causing confusion will have more chance of moving to a new location

than it would without global rearrangement.

 

 

A WARNING ON INTERPRETING RESULTS

Probably the most important thing to keep in mind while running any of the

parsimony or compatibility programs is not to overinterpret the result. Many

users treat the set of most parsimonious trees as if it were a confidence

interval. If a group appears in all of the most parsimonious trees then they

treat it as well established. Unfortunately THE CONFIDENCE INTERVAL ON

PHYLOGENIES APPEARS TO BE MUCH LARGER THAN THE SET OF ALL MOST PARSIMONIOUS

TREES (Felsenstein, 1985b). Likewise, variation of result among different

methods will not be a good indicator of the size of the confidence interval.

Consider a simple data set in which, out of 100 binary characters, 51 recommend

the rooted tree ((A,B),C) and 49 the tree (A,(B,C)). Many different methods

will all give the same result on such a data set: they will estimate the tree

as ((A,B),C). Nevertheless it is clear that the 51:49 margin by which this

tree is favored is not significantly different from 50:50. So CONSISTENCY

AMONG DIFFERENT METHODS IS A POOR GUIDE TO STATISTICAL SIGNIFICANCE.

 

 

RELATIVE SPEED OF DIFFERENT PROGRAMS AND MACHINES

Relative speed of the different programs

-------- ----- -- --- --------- --------

C compilers differ in efficiency of the code they generate, and some deal

with some features of the language better than with others. Thus a program

which is unusually fast on one computer may be unusually slow on another.

Nevertheless, as a rough guide to relative execution speeds, I have tested the

programs on three data sets, each of which has 10 species and 20 characters.

The first is an imaginary one in which all characters are compatible - ("The

Willi Hennig Memorial Data Set" as J. S. Farris once called it). The second is

the binary recoded form of the fossil horses data set of Camin and Sokal

(1965). The third data set has data that is completely random: 10 species and

20 characters with a 50% chance that each character state is 0 or 1 (or A or

G). The data sets range from a completely compatible one in which there is no

homoplasy (paralellism or convergence), through the horses data set, which

requires 29 steps where the possible minimum number would be 20, to the random

data set, which requires 49 steps. We can thus see how this increasing

messiness of the data affects running times.

 

 

Here are the nucleotide sequence versions of the three data sets:

10 20

A CACACACAAAAAAAAAAACA

B CACACAACAAAAAAAAAACA

C CACAACAAAAAAAAAAAACA

D CAACAAAACAAAAAAAAACA

E CAACAAAAACAAAAAAAACA

F ACAAAAAAAACACACAAAAC

G ACAAAAAAAACACAACAAAC

H ACAAAAAAAACAACAAAAAC

I ACAAAAAAAAACAAAACAAC

J ACAAAAAAAAACAAAAACAC

10 20

MesohippusAAAAAAAAAAAAAAAAAAAA

HypohippusAAACCCCCCCAAAAAAAAAC

ArchaeohipCAAAAAAAAAAAAAAAACAC

ParahippusCAAACAACAACAAAAAAAAC

MerychippuCCAACCACCACCCCACACCC

M. secunduCCAACCACCACCCACACCCC

Nannipus CCAACCACAACCCCACACCC

NeohippariCCAACCCCCCCCCCACACCC

Calippus CCAACCACAACCCACACCCC

PliohippusCCCACCCCCCCCCACACCCC

10 20

A CACACAACCAAACAAACCAC

B AAACCACACACACAAACCCA

C ACAAAACCAAACCACCCACA

D AAAAACACAACACACCAAAC

E AAACAACCACACACAACCAA

F CCCAAACACCCCCAAAAAAC

G ACACCCCCACACCCACCAAC

H AAAACAACAACCACCCCACC

I ACACAACAACACAAACAACC

J CCAAAAACACCCAACCCAAC

Here are the timings of many of the version 3.5 programs on these three

data sets as run after being compiled by Microsoft Quick C on an 16 MHz 80386SX

computer under PCDOS 5.0. An 80387 math co-processor was present and was used

by the compiled code.

 

Hennigian Data Horses Data Random Data

PROTPARS 82.83 86.23 148.03

DNAPARS 5.98 5.66 11.54

DNAPENNY 46.03 23.51 5305.97

DNACOMP 7.14 6.43 11.86

DNAINVAR 0.61 0.66 0.61

DNAML 1928.99 2069.32 2611.48

DNAMLK 2247.12 6094.81 4993.00

DNADIST 3.57 4.50 5.38

RESTML 6818.34 13422.15 28418.34

FITCH 35.92 48.61 38.17

KITSCH 12.42 12.36 13.18

NEIGHBOR 2.20 2.14 2.903

CONTML 56.85 57.56 59.15

GENDIST 1.00 1.00 1.00

MIX 13.62 14.60 25.92

 

 

PENNY 8.41 21.31 3851.1

DOLLOP 26.69 26.86 46.30

DOLPENNY 12.25 56.57 23934.22

CLIQUE 0.77 0.71 0.77

FACTOR 0.39 0.44 0.44

In all cases the programs were run under the default options, except as

specified here. The data sets used for the discrete characters programs have

0's and 1's instead of A's and C's. For CONTML the 0's and 1's were made into

0.0's and 1.0's and considered as 20 2-allele loci. For the distance programs

10 x 10 distance matrices were computed from the three data sets. Nor does it

make much sense to benchmark MOVE, DOLMOVE, or DNAMOVE, although when there are

many characters and many species the response time after each alteration of the

tree should be proportional to the product of the number of species and the

number of characters. For DNAML and DNAMLK the frequencies of the four bases

were set to be equal rather than determined empirically as is the default. For

RESTML the number of enzymes was set to 1.

Several patterns will be apparent from this. The algorithms (MIX, DOLLOP,

CONTML, FITCH, KITSCH, PROTPARS, DNAPARS, DNACOMP, and DNAML, DNAMLK, RESTML)

that use the above-described addition strategy have run times that do not

depend strongly on the messiness of the data. The only exception to this is

that if a data set such as the Random data requires one extra round of global

rearrangements it takes longer. The programs differ greatly in run time: the

likelihood programs RESTML, DNAML and CONTML are quite a bit slower than the

others. The protein sequence parsimony program, which has to do a considerable

amount of bookkeeping to keep track of which amino acids can mutate to each

other, is also relatively slow.

Another class of algorithms includes PENNY, DOLPENNY, DNAPENNY and CLIQUE.

These are branch-and-bound methods: in principle they should have execution

times that rise exponentially with the number of species and/or characters, and

they might be much more sensitive to messy data. This is apparent with PENNY,

DOLPENNY, and DNAPENNY, which go from being reasonably fast with clean data to

very slow with messy data. DOLPENNY is paritcularly slow on messy data -- this

is because this algorithm cannot make use of some of the lower-bound

calculations that are possible with DNAPENNY and PENNY. CLIQUE is very fast on

all data sets. Although in theory it should bog down if the number of cliques

in the data is very large, that does not happen with random data, which in fact

has few cliques and those small ones. Apparently the "worst-case" data sets

are much rarer for CLIQUE than for the other branch-and-bound methods.

NEIGHBOR is quite fast compared to FITCH and KITSCH, and should make it

possible to run much larger cases, although the results are expected to be a

bit rougher than with those programs.

 

Speed with different numbers of species

----- ---- --------- ------- -- -------

How will the speed depend on the number of species and the number of

characters? For the sequential-addition algorithms, the speed should be

proportional to the cube of the number of species, and to the number of

characters. Thus a case that has, instead of 10 species and 20 characters, 20

species and 50 characters would take 2 x 2 x 2 x 2.5 = 20 times as long. This

implies that cases with more than 20 species will be slow, and cases with more

than 40 species VERY slow. This places a premium on working on small

subproblems rather than just dumping a whole large data set into the programs.

An exception to these rules will be some of the DNA programs that use an

aliasing device to save execution time. In these programs execution time will

 

 

not necessarily increase proportional to the number of sites, as sites that

show the same pattern of nucleotides will be detected as identical and the

calculations for them will be done only once, which does not lead to more

execution time. This is particularly likely to happen with few species and

many sites, or with data sets that have small amounts of evolutionary

divergence.

For programs FITCH and KITSCH, the distance matrix is square, so that when

we double the number of species we also double the number of "characters", so

that running times will go up as the fourth power of the number of species

rather than the third power. Thus a 20-species case with FITCH is expected to

run sixteen times more slowly than a 10-species case.

For programs like PENNY and CLIQUE the run times will rise faster than the

cube of the number of species (in fact, they can rise faster than any power

since these algorithms are not guaranteed to work in polynomial time). In

practice, PENNY will frequently bog down above 11 species, while CLIQUE easily

deals with larger numbers.

For NEIGHBOR the speed should vary only as the square of the number of

species, so a case twice as large will take only four times as long. This will

make it an attractive alternative to FITCH and KITSCH for large data sets.

If you are unsure of how long a program will take, try it first on a few

species, then work your way up until you get a feel for the speed and for what

size programs you can afford to run.

Execution time is not the most important criterion for a program,

particularly as computer time gets much cheaper than your time or a

programmer's time. With workstations on which background jobs can be run all

night, execution speed is not overwhelmingly relevant. Some of us have been

conditioned by an earlier era of computing to consider execution speed

paramount. But ease of use, ease of adaptation to your computer system, and

ease of modification are much more important in practice, and in these respects

I think these programs are adequate. Only if you are engaged in 1960's style

mainframe computing is minimization of execution time paramount.

Nevertheless it would have been nice to have made the programs faster.

The present speeds are a compromise between speed and effectiveness: by making

them slower and trying more rearrangements in the trees, or by enumerating all

possible trees, I could have made the programs more likely to find the best

tree. By trying fewer rearrangements I could have speeded them up, but at the

cost of finding worse trees. I could also have speeded them up by writing

critical sections in assembly language, but this would have sacrificed ease of

distribution to new computer systems. There are also some options included in

these programs that make it harder to adopt some of the economies of

bookkeeping that make other programs faster. However to some extent I have

simply made the decision not to spend time trying to speed up program

bookkeeping when there were new likelihood and statistical methods to be

developed.

 

Relative speed of different machines

It is interesting to compare different machines using DNAPARS as the

standard task. One can rate a machine on the DNAPARS benchmark by summing the

times for all three of the data sets. Here are relative total timings over all

three data sets (done with various versions of DNAPARS) for some machines,

taking Microsoft Quick C running under PCDOS on a 16 MHz 80386 clone as the

standard. Pascal benchmarks from version 3.4 of the program are also included

-- they are compared only with each other and their times are in parentheses.

 

 

This use of two separate standards is necessary not because of different

languages but because different versions of the package are being compared.

Thus, the "Time" is the ratio of the Total to that for the 386SX, for the

appropriate standard, so that the Time for the Macintosh Classic for DNAPARS

3.4 on Think Pascal 3 is compared to the Time for the 386/SX running DNAPARS

3.4 on Turbo Pascal 6.0, but the Time for the Macintosh Classic running version

3.5 on Think C is compared to the Time for the 386SX running version 3.5 on

Quick C. The Speed is the reciprocal of the Time.

Machine DOS Compiler Total Time Speed

------- --- -------- ----- ---- -----

Toshiba T1100+ PCDOS Turbo Pascal 3.01A (269) 7.912 0.126

Apple Mac Plus MacOS Lightspeed Pascal 2 (175.84) 5.172 0.193

Toshiba T1100+ PCDOS Turbo Pascal 5.0 (162) 4.765 0.210

Macintosh Classic MacOS Think Pascal 3 (160) 4.706 0.212

Macintosh Classic MacOS Think C 43.0 3.58 0.279

IBM PS2/60 PCDOS Turbo Pascal 5.0 (58.76) 1.728 0.579

80286 (12 Mhz) PCDOS Turbo Pascal 5.0 (47.09) 1.385 0.722

Apple Mac IIcx MacOS Think Pascal 3 (42) 1.235 0.810

Apple Mac SE/30 MacOS Think Pascal 3 (42) 1.235 0.810

Apple Mac IIcx MacOS Lightspeed Pascal 2 (39.84) 1.172 0.853

Apple Mac IIcx MacOS Lightspeed Pascal 2# (39.69) 1.167 0.857

Zenith Z386 (16MHz) PCDOS Turbo Pascal 5.0 (38.27) 1.155 0.866

Macintosh SE/30 MacOS Think C 13.6 1.132 0.883

80386SX (16 MHz) PCDOS Turbo Pascal 6.0 (34) 1.0 1.0

80386SX (16 MHz) PCDOS Microsoft Quick C 12.01 1.0 1.0

Sequent-S81 DYNIX Silicon Valley Pascal (13.0) 0.382 2.615

VAX 11/785 Unix Berkeley Pascal (11.9) 0.35 2.857

80486-33 PCDOS Turbo Pascal 6.0 (11.46) 0.337 2.967

Sun 3/60 SunOS Sun C 3.93 0.327 3.056

NeXT Cube (68030) Mach Gnu C 2.608 0.217 4.605

Sequent S-81 DYNIX Sequent Symmetry C 2.604 0.217 4.612

VAXstation 3500 Unix Berkeley Pascal (7.3) 0.215 4.658

Sequent S-81 DYNIX Berkeley Pascal (5.6) 0.1647 6.07

Unisys 7000/40 Unix Berkeley Pascal (5.24) 0.1541 6.49

VAX 8600 VMS DEC VAX Pascal (3.96) 0.1165 8.59

Sun SPARC IPX SunOS Gnu C version 2.1 1.28 0.1066 9.383

VAX 6000-530 VMS DEC C 0.858 0.0714 13.998

VAXstation 4000 VMS DEC C 0.809 0.0674 14.845

IBM RS/6000 540 AIX XLP Pascal (2.276) 0.0669 14.94

NeXTstation(040/25) Mach Gnu C 0.75 0.0624 16.013

Sun SPARC IPX SunOS Sun C 0.68 0.0566 17.662

486DX (33 MHz) Linux Gnu C # 0.63 0.0525 19.063

Sun SPARCstation-1+ Unix Sun Pascal (1.7) 0.05 20.00

DECstation 5000/200 Unix DEC Ultrix C 0.45 0.0375 26.69

Sun SPARC 1+ SunOS Sun C 0.40 0.0333 30.025

DECstation 3100 Unix DEC Ultrix RISC Pascal (0.77) 0.0226 44.16

IBM 3090-300E AIX Metaware High C 0.27 0.0225 44.48

DECstation 5000/125 Unix DEC Ultrix RISC C 0.267 0.0222 44.98

DECstation 5000/200 Unix DEC Ultrix RISC C 0.256 0.0222 44.98

Sun SPARC 4/50 SunOS Sun C 0.249 0.02073 48.23

DEC 3000/400 AXP Unix DEC C 0.224 0.01865 53.62

DECstation 5000/240 Unix DEC Ultrix RISC C 0.1889 0.01573 63.58

SGI Iris R4000 Unix SGI C 0.184 0.1532 65.27

IBM 3090-300E VM Pascal VS (0.464) 0.0136 73.28

DECstation 5000/200 Unix DEC Ultrix RISC Pascal (0.39) 0.0114 87.18

The Toshiba T1100+ should be exactly as fast as an 8 MHz PC clone. For a

couple of the machines I am not sure that this benchmark is representative of

timings on non-numerical programs in PHYLIP. This is particularly the case for

 

 

the DEC 3000/400 AXP (the DEC "Alpha") which is probably quite a bit faster

than indicated here. The numerical programs benchmark below gives it a fairer

test. The IBM RS/6000 is probably up to ten times faster than shown here: it

may have been ill-served by its Pascal compiler.

Note that parallel machines like the Sequent are not really as slow as

indicated by the data here, as these runs did nothing to take advantage of

their parallelism.

For a picture of speeds for a more numerically intensive program, here are

benchmarks using DNAML, with the 16 MHz 386SX with math co-processor active as

the standard. Numbers are total run times (total user time in the case of

Unix) over all three data sets.

Operating

Machine System Compiler Seconds Time Speed

------- --------- -------- ------- ---- -----

386SX 16 Mhz PCDOS Turbo Pascal 6 (7826) 1.0 1.0

386SX 16 Mhz PCDOS Quick C 6549.79 1.0 1.0

Compudyne 486DX/33 Linux Gnu C 1599.9 0.2441 4.096

SUN Sparcstation 1+ SunOS Sun C 1402.8 0.2142 4.669

Everex STEP 386/20 PCDOS Turbo Pascal 5.5 (1440.8) 0.1841 5.432

486DX/33 PCDOS Turbo C++ 1107.2 0.1690 5.916

Compudyne 486DX/33 PCDOS Waterloo C/386 1045.78 0.1597 6.263

Sun SPARCstation IPX SunOS Gnu C 960.2 0.1466 6.821

NeXTstation(68040/25) Mach Gnu C 916.6 0.1399 7.146

486DX/33 PCDOS Waterloo C/386 861.0 0.1314 7.607

Sun SPARCstation IPX SunOS Sun C 787.7 0.1203 8.315

486DX/33 PCDOS Gnu C 650.9 0.0994 10.063

VAX 6000-530 VMS DEC C 637.0 0.0973 10.282

DECstation 5000/200 Unix DEC Ultrix RISC C 423.3 0.0646 15.473

IBM 3090-300E AIX Metaware High C 201.8 0.0308 32.46

Convex C240/1024 Unix C 101.6 0.01551 64.47

DEC 3000/400 AXP Unix DEC C 98.29 0.01501 66.64

You are invited to send me figures for your machine for inclusion in

future tables. Use the data sets above and compute the total times for DNAPARS

and for DNAML for the three data sets (setting the frequencies of the four

bases to 0.25 each for the DNAML runs). Be sure to tell me the name and

version of your compiler, and the version of PHYLIP you tested.

 

Published Benchmarks

--------- ----------

Some of you may have seen the "benchmark" published by Luckow and Pimentel

(1985). PHYLIP's WAGNER (an immediate ancestor of MIX) did not do well in it,

either in terms of the quality of result or execution speed. I do not believe

that this was a fair benchmark. WAGNER was run only with one order of input

species, not ten as recommended here. Had it been, perhaps the shortest tree

would have been found more often. No credit was given to PHYLIP in that

article for its free distribution, availability on microcomputers, availability

in source code form, or portability to new computers. Pimentel's laboratory

commissioned the development of a competing package, PHYSYS, which is a

commercial product, and that involvement was not stated in the article.

The benchmarks by Fink (1986) are fairer, although there are some

impressions given by that article which do not apply to the present version.

In particular, I have since added to many of the programs the ability to save

multiple equally-parsimonious trees, and have changed the outputs so that

 

 

reconstruction of states in the hypothetical ancestral nodes is much easier,

thus answering Fink's major criticisms. I have since eliminated the Metropolis

annealing method algorithms which he criticized. I disagree with Fink's view

OF PHYLIP that one should "be wary of published results from an analysis using

it", as I do not think that a tree slightly longer than the most parsimonious

one should be rejected out of hand. Nor do I agree that "it is really too slow

to use as a teaching tool", as in teaching one uses small data sets and speed

is not of the essence. Rather, simplicity of user interface is paramount, and

there PHYLIP does very well (so is ability to run on a variety of computers, in

which respect PHYLIP is also superior). In fact, it is widely used as a

teaching tool.

Nevertheless MIX is undoubtably not as fast or as sophisticated as PAUP or

Hennig86. The present version of PHYLIP is closer to its competitors in

quality of result than was the version Fink reviewed.

Platnick's (1987) benchmarks concentrated, as did the other benchmarkers

(all of them members of the same school of systematists) on parsimony as the

only phylogeny criterion worthy of attention. He concluded that PHYLIP could

be used effectively, especially if up to ten different input orders of species

were used. Again, as with the other benchmarks, no credit was given for

diversity of methods, portability, price, or availability of source code.

Platnick's second benchmark paper (1989) concentrates on Hennig86 and

Paup, and concludes that PHYLIP has not kept up with those programs in its

features. Again, the review is entirely concerned with parsimony, and only the

barest mention is made of ... (you can complete this sentence).

Sanderson's (1990) benchmark paper breaks with the method of the others by

specifying 36 features of the packages rated and giving separate ratings in

each. Like the other benchmark papers it concentrates almost exclusively on

parsimony as applied to morphological characters, but does at least give some

credit where credit is due.

My own, obviously biased, feeling is that there is a discrepancy between

the benchmarkers' projections of how satisfied users of PHYLIP will be, and how

satisfied they actually are. And that this discrepancy is in PHYLIP's favor.

 

 

ENDORSEMENTS

Here are some comments about PHYLIP. Explanatory material in square

brackets is my own:

 

From the pages of Cladistics:

"Under no circumstances can we recommend PHYLIP/WAG [their name for the

Wagner parsimony option of MIX]."

Luckow, M. and R. A. Pimentel (1985)

"PHYLIP has not proven very effective in implementing parsimony (Luckow and

Pimentel, 1985)."

J. Carpenter (1987a)

"... PHYLIP. This is the computer program where every newsletter concerning

it is mostly bug-catching, some of which have been put there by previous

corrections. As Platnick (1987) documents, through dint of much labor

useful results may be attained with this program, but I would suggest an

easier way: FORMAT b:"

 

 

J. Carpenter (1987b)

"PHYLIP is bug-infested and both less effective and orders of magnitude

slower than other programs ...."

"T. N. Nayenizgani" [J. S. Farris] (1990)

"Hennig86 [by J. S. Farris] provides such substantial improvements over

previously available programs (for both mainframes and microcomputers) that

it should now become the tool of choice for practising systematists."

N. Platnick (1989)

 

and in the pages of other journals:

"The availability, within PHYLIP of distance, compatibility, maximum

likelihood, and generalized 'invariants' algorithms (Cavender and

Felsenstein, 1987) sets it apart from other packages .... One of the

strengths of PHYLIP is its documentation ...."

Michael J. Sanderson (1990)

(Sanderson also criticizes PHYLIP for slowness and inflexibility of its

parsimony algorithms, and compliments other packages on their strengths).

 

"This package of programs has gradually become a basic necessity to anyone

working seriously on various aspects of phylogenetic inference .... The

package includes more programs than any other known phylogeny package. But

it is not just a collection of cladistic and related programs. The package

has great value added to the whole, and for this it is unique and of extreme

importance .... its various strengths are in the great array of methods

provided ...."

Bernard R. Baum (1989)

 

(see also above under Benchmarks for W. Fink's critical remarks (1986) on

version 2.8 of PHYLIP).

 

 

GENERAL COMMENTS ON ADAPTING THE PACKAGE TO DIFFERENT COMPUTER SYSTEMS

In the sections following you will find instructions on how to adapt the

programs to different computers and compilers. The programs should compile

without alteration on most versions of C. They use the "malloc" library or

"calloc" function to allocate memory so that the upper limits on how many

species or how many sites or characters they can run is set by the system

memory available to that memory-allocation function.

In the document file for each program, I have supplied a small input

example, and the output it produces, to help you check whether the programs are

running properly.

Most of the programs read their data from a file called "infile" and write

their output to a file called "outfile" and a tree file to a file "treefile".

If "infile" does not exist the program will prompt you for its name.

 

 

 

 

 

 

 

 

 

Compiling the programs

--------- --- --------

Many machines that have C compilers, particularly Unix systems, have a

utility called "make" available that considerably simplifies the process of

compiling these programs. I will first discuss how to compile these programs

with "make" and then, after a digression on how to move PHYLIP to a

microcomputer, discuss for different individual systems how to compile the

programs. As we shall see below, for some DOS and Macintosh compilers one

cannot simply use "make" and the standard Makefile.

 

Using "make"

----- ------

If your machine has "make" you can place all the programs for the package,

together with the file "Makefile" and the header files "phylip.h", and

"drawgraphics.h", in one directory. The Makefile and header files are

constructed to detect, for many varieties of C, which it is dealing with, and

inform the programs accordingly so that they can (by using "#ifdef") adapt to

the idiosyncracies of the compiler.

To compile all the programs just type: make all

To compile just one program, such as DNAML, type: make dnaml

After a time the compiler will finish compiling. The names of the

executables will be the same as the names of the C programs, but without the

".c" suffix. Thus dnaml.c compiles to make an executable called "dnaml". If

object modules ending in ".o" are found in the directory after compilation they

can be removed if you need space.

 

Getting PHYLIP onto your microcomputer

------- ------ ---- ---- -------------

C is widely available on microcomputers, and in any case we also

distribute executable versions for PCDOS, 386 PCDOS, and Macintosh systems.

Your institution may have an Internet connection, and if so there is probably a

PCDOS system or a Macintosh somewhere connected directly to it. Using that

machine you could download the executables and put them directly into diskette

for transfer to your own machine. You can also get the source code,

documentation, and executables by sending me the appropriate number of

diskettes (see the general information at the start of this document).

If you cannot do this, you may be able to transfer the entire package, in

the form of self-extracting archives (which is one of the ways we distribute it

for microcomputers) to your system using a terminal program with file transfer

capabilities. Some users are sufficiently terrified of this prospect that they

prefer to mail us diskettes and wait for several weeks. But if your

institution has an Internet connection it is much faster to do it that way. If

you have a serial port to which a modem can be hooked, you can get a terminal

program and do the transfers yourself. For most microcomputer systems,

public-domain or shareware terminal programs are available, such as the

widely-distributed KERMIT and MODEM families of programs. Most university

computer centers have communications programs (KERMIT or XMODEM) to "talk" to

KERMIT, MODEM, or PC-TALK and transfer files to and from it.

Thus, if you cannot get from me a disk format readable by your machine,

you can:

 

 

 

(1) Get an account on your mainframe and learn to use its facilities for

"anonymous ftp" (transfer of files over Internet) or electronic mail.

(2a) If you are on Internet (Or NSFNET) use the "anonymous ftp" method to

receive the self-extracting archive files (start by downloading and

reading the file "pub/phylip/Read.Me" from my system whose Internet

address is evolution.genetics.washington.edu (128.95.12.41)), or

(2b) if your institution is not on Internet but does have Bitnet

electronic mail, you can request that I send you the PHYLIP source code

files and documentation as e-mail messages over BITNET/EARN (not the

executables, however).

(3) Make sure the files are saved on your mainframe account (you will need

about 2.2 Megabytes of space) under appropriate names.

(4) Use the file transfer provisions of your terminal program to transfer

the archives to your microcomputer, or if they came as many e-mail

messages, to transfer these to your machine individually (most file

transfer programs can transfer many files with one command) for later

compilation of the C source.

If you cannot read the diskette formats that I can write, and if you

absolutely INSIST that I distribute the package in this format, please send me

the computer and thirteen diskettes. I will promptly write the diskettes and

return them (but of course I will keep your computer).

Now we turn to particular C compilers and describe particular problems

that may be encountered.

 

 

Microsoft Quick C and Microsoft C

--------- ----- - --- --------- -

These comments apply to Microsoft Quick C but may also work with Microsoft

C. A Makefile for Microsoft Quick C is included with the source code. It is

called "Makefile.qc". If you copy it and call the copy "Makefile" (making sure

to first save the generic Makefile that comes with this package under some name

such as Makefile.old), you should be able to use "make" as described above,

except that it is called "nmake". Note that the command you must use to

compile (for example) DNAPARS is "nmake dnapars.exe", not "nmake dnapars", as

the program that results is to be called "dnapars.exe" and the Quick C Makefile

is set up that way.

To compile individual programs without using the makefile, you need to do

the following. For a non-graphics program use the following command (DOS> is

the PCDOS prompt, so you do not type it):

DOS> qcl /AH /F 4000 /FPi [source files]

If the program you are trying to compile is a 1-part source (for example,

neighbor only has one part, neighbor.c) you should replace "[source files]"

with "neighbor.c". So the command would be:

DOS> qcl /AH /F 4000 /FPi neighbor.c

If the program you are trying to compile is a 2-part source (for example, mix

has two parts, mix.c and mix2.c) you can replace [source files] with both of

the source files. Make sure that the first source file in the list has the

same name as the executable file you want. i.e. use mix.c mix2.c and not the

 

 

other way around. If you reorder them, the executable file will be called

"MIX2.EXE". For mix, the command would be:

DOS> qcl /AH /F 4000 /FPi mix.c mix2.c

to compile a graphics program (i.e. drawgram, drawtree) under quick c without

using the makefile, use one of the following commands:

for DRAWGRAM:

DOS> qcl /AH /F 4000 /FPi drawgram.c drawgraphics.c graphics.lib [for drawgram]

for DRAWTREE:

DOS> qcl /AH /F 4000 /FPi drawtree.c drawgraphics.c graphics.lib [for drawtree]

 

 

Turbo C++ for PCDOS

----- --- --- -----

The following instructions are for Turbo C++ but may also work for Turbo C and

for Borland C, perhaps with slight modifications. Under normal situations you

can use the makefile. The makefile for Turbo C++ is included in the package as

"Makefile.tc". Copy it and call the copy "Makefile" (it would be wise the first

rename the original "Makefile" to "Makefile.old"). Then to compile, say,

DNAPARS, just type:

make dnapars.exe

However, if for some reason you want to do it by hand, follow the following

steps:

For the non-graphical programs (all those other than DRAWGRAM and DRAWTREE):

to compile dnapars.c type the following (DOS> is the PCDOS prompt)

DOS> tcc -mh dnapars.c

If the source file is sufficiently large to require two sources (for example,

dnaml.c and dnaml2.c), you will need to use both dnaml.c and dnaml2.c.

Examples:

DOS> tcc -mh dnaml.c dnaml2.c

DOS> tcc -mh neighbor.c

If you would like to use the program under the TD debugger, you should

add a "-v" flag as a compiler option:

DOS> tcc -mh -v restml.c restml2.c

For the graphical programs (DRAWGRAM and DRAWTREE):

First you need to build the "BGI" drivers. The BGI drivers are included

with your TURBOC compiler, and should be in the "BGI" directory (this is

a subdirectory of the main turboc directory). To do this you need to use

the "bgiobj" program, also in the BGI directory. The current version

of PHYLIP supports the EGA/VGA, CGA, and hercules drivers. If you have

modified the sources to take advantage of other drivers, you will have

to include those as well.

To build the BGI drivers:

DOS> cd \tc\bgi [this should be replaced with whatever your turboc dir is]

 

 

DOS> BGIOBJ EGAVGA

DOS> BGIOBJ CGA

DOS> BGIOBJ HERC

this generates the files "EGAVGA.OBJ", "CGA.OBJ", and "HERC.OBJ" in the

current directory. you want to copy this into your main source directory.

(assume this is \phylip)

DOS> CP EGAVGA.OBJ \phylip [replace this with your source directory]

DOS> CP CGA.OBJ \phylip

DOS> CP HERC.OBJ \phylip

To compile the program, cd back to your source directory. You want

to compile each source file, plus a shared graphics file called

"drawgraphics.c". You also want to link it to the newly created BGI

object files and to the graphics library.

Examples:

DOS> tcc -mh drawgram.c drawgraphics.c herc.obj egavga.obj cga.obj graphics.lib

DOS> tcc -mh drawtree.c drawgraphics.c herc.obj egavga.obj cga.obj graphics.lib

(to compile drawgram and drawtree, respectively)

If you want to compile for the TD debugger, add the -v flag as above.

 

Waterloo C/386

-------- -----

Waterloo C/386 is the compiler we use to create the 386 PCDOS and 386

Windows versions of the executables. It has a "make" capability called

"wmake". We have had problems using this so the instructions here are for

individually compiling programs without wmake.

Watcom C/386 is a very flexible compiler which can generate executable

programs for many different environments. Following are instructions for using

Watcom C/386 to compile for DOS using the DOS/4GW DOS extender (included with

the Watcom distribution) and for Microsoft windows.

DOS/4GW:

to compile a program under watcom C/386 for the DOS/4GW dos extender use

the following (the "DOS>" is the PCDOS prompt, not something you type):

DOS> wcl386 /l=dos4gw /p /k65520 [source files]

If the program you are trying to compile is a 1-part source (for example,

neighbor only has one part, neighbor.c) you can replace [source files] with

"neighbor.c". So the command would be:

DOS> wcl386 /l=dos4gw /p /k65520 neighbor.c

If the program you are trying to compile is a 2-part source (for example, mix

has two parts, mix.c and mix2.c) you can replace [source files] with both of

the source files. Make sure that the first source file in the list has the

same name as the executable file you want. i.e. use mix.c mix2.c and not the

other way around. If you reorder them, the executable file will be called

"MIX2.EXE". For mix, the command would be:

DOS> wcl386 /l=dos4gw /p /k65520 mix.c mix2.c

 

 

The resultant executable file will take advantage of your system's extended

memory and will not be limited to using only the first 640K. However, it needs

the file "dos4gw.exe" in order to run. If you want to be able to use the

program generated, make sure that this program is somewhere in your path. (To

ensure this you can copy the program into the directory where the compiled

program resides). This "dos extender" is bundled with the Watcom C/386

compiler and is freely redistributable.

For Windows:

to compile a program under watcom C/386 for windows use the following:

DOS> wcl386 /l=win386 /zw /p /k65520 [source files]

again, replace [source files] with either the complete program (ie neighbor.c)

or both parts of the program (ie mix.c mix2.c).

once you have compiled the windows program you are not quite ready to run the

program under windows. The final step is to link it with the "windows

supervisor". to do this do the following:

DOS> wbind [program] -n

i.e.:

DOS> wbind mix -n

this program will generate [programname].exe. this application will be

runnable under windows.

CAVEATS:

1. Make sure that when you use wbind that \watcom\binw is somewhere in

your path. if it is not, you may have to tell wbind explicitly where

the windows supervisor file is, as in the following example:

DOS> wbind mix -n -s c:\watcom\binw\win386.ext which will replace the

c:\watcom\win386.ext with the full path of win386.ext.

2. The draw programs (drawgram, drawtree) currently do not compile

under windows. Compile them for DOS/4GW and use it in a dos shell under

windows

 

 

Think C for Macintosh

----- - --- ---------

For Symantec's Think C compiler (formerly called Lightspeed C) a "make"

utility is not available. Thus you cannot use the Makefile but must compile

the programs individually. Here are the steps you should follow to compile a

typical program.

(1) Start up Think-C.

(2) Click on "New project" in the Think C project menu. You will be asked to

enter the name of the project.

(3) Add the source code for the program to the project. To add sources to the