ClustalW1.7/004075500063530000012000000000000636737001400141005ustar00pingouinstaff00000400000036ClustalW1.7/clustalw.html010064400063530000012000000632000636737123100166240ustar00pingouinstaff00000400000036 ClustalW

Clustal W version 1.6 March 1996



Julie Thompson (Thompson@EMBL-Heidelberg.DE)
Toby Gibson (Gibson@EMBL-Heidelberg.DE)
European Molecular Biology Laboratory
Meyerhofstrasse 1
D 69117 Heidelberg
Germany

Des Higgins (Higgins@EBI.AC.UK)
European Bioinformatics Institute
Hinxton Hall
Hinxton
Cambridge CB10 1RQ
UK

Please e-mail bug reports/complaints/suggestions (polite if possible) to Toby Gibson or Des Higgins.


Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994)
CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice. Nucleic Acids Research, 22:4673-4680.



What's New (March 1996) in Version 1.6 (since version 1.5).

  1. Improved handling of sequences of unequal length. Previously, we increased the gap extension penalties for both sequences if the two sequences (or groups of previously aligned sequences) were of different lengths. Now, we increase the gap opening and extension penalties for the shorter sequence only. This helps prevent short sequences being stretched out along longer ones.
  2. Added the "Gonnet" series of weight matrices (from Gaston Gonnet and co-workers at the ETH in Zurich). Fixed a bug in the matrix choice menu; now PAM matrices can be selected ok.
  3. Added secondary structure/gap penalty masks. These allow you to include, in an alignment, a position specific set of gap penalties. You can either set a gap opening penalty at each position or specify the secondary strcuture (if protein; alpha helix, beta strand or loop) and have gap penalties set automatically. This, basically, is used to make gaps harder to open inside helices or strands.
    These masks are only used in the "profile alignment" menu. They may be read in as part of an alignment in a special format (see the on-line help for details) or associated with each sequence, if the sequences are in Swiss Prot format and secondary structure information is given. All of the mask parameters can be set from the profile alignment menu. Basically, the mask is made up of a series of numbers between 1 and 9, one per position. The gap opening penalty at a position is calculated as the starting penalty multipleied by the mask value at that site.
  4. Added command line options /profile and /sequences. These allow uses to choose between normal profile alignment where the two profiles (pre-existing alignments specified in the files /profile1= and /profile2=) are merged/aligned with each other (/profile) and the case where the individual sequences in /profile2 are aligned sequentially with the alignment in /profile1 (/sequences).
  5. Fixed bug in modified Myers and Miller algorithm - gap penalty score was not always calculated properly for type 2 midpoints. This is the core alignment algorithm.
  6. Only allows one output file format to be selected from command line - ie. multiple output alignment files are not allowed.
  7. Fixed 'bad calls to ckfree' error during calculation of phylip distance matrix.
  8. Fixed command line options /gapopen /gapext /type=protein /negative.
  9. Allowed user to change command line separator on UNIX from '/' to '-'. This allows unix users to use the more conventinal '-' symbol for seperating command line options. "/" can then be used in unix file names on the command line. The symbol that is used, is specified in the file clustalw.h which must be edited if you wish to change it (and the program must then be recompiled). Find the block of code in clustalw.h that corrsponds to the operating system you are using. These blocks are started by one of the following:
    #ifdef VMS #elif MAC #elif MSDOS #elif UNIX
    On the next line after each is the line:
    #define COMMANDSEP '/'
    Change this in the appropriate block of code (e.g. the UNIX block) to
    #define COMMANDSEP '-'
    if you wish to use the "-" character as command seperator.


What's New (April 1995) in Version 1.5 (since version 1.3).

  1. alignment of new sequences to an alignment. Fixed a serious bug which assigned weights to the wrong sequences. Now also, weights sequences according to distance from the incoming sequence. The new weights are: tree weights * similarity to incoming sequence. The tree weights are the old weights that we derive from the tree connecting all the sequences in the existing alignment.
  2. for all platforms, output linelength = 60.
  3. Bootstrap files (*.phb): the "final" node (arbitrary trichotomy at the end of the neighbor-joining process) is labelled as TRICHOTOMY in the bootstrap output files. This is to help link bootstrap figures with nodes when you reroot the tree.
  4. Command line /bootstrap option now more robust.

INTRODUCTION

This document gives some BRIEF notes about usage of the Clustal W multiple alignment program for UNIX and VMS machines. Clustal W is a major update and rewrite of the Clustal V program which was described in:

Higgins, D.G., Bleasby, A.J. and Fuchs, R. (1992) CLUSTAL V: improved software for multiple sequence alignment. Computer Applications in the Biosciences (CABIOS), 8(2):189-191.

The main new features are a greatly improved (more sensitive) multiple alignment procedure for proteins and improved support for different file formats. This software was described in:

Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position specific gap penalties and weight matrix choice. Nucleic Acids Research, 22(22):4673-4680.


The usage of Clustal W is largely the same as for Clustal V details of which are described in clustalv.doc. Details of the new alignment algorithms are described in the manuscript by Thompson et. al. above, an ascii/text version of which is included (clustalw.ms). This file lists some of the details not covered by either of the above documents.



INSTALLATION (deleted)



FILE INPUT (sequences to be aligned)

The sequences must all be in one file (or two files for a "profile alignment") in ONE of the following formats:

FASTA (Pearson), NBRF/PIR, EMBL/Swiss Prot, GDE, CLUSTAL, GCG/ MSF.

The program tries to "guess" which format is being used and whether the sequences are nucleic acid (DNA/RNA) or amino acid (proteins). The format is recognised by the first characters in the file. This is kind of stupid/crude but works most of the time and it is difficult to do reliably, any other way.



 Format           First non blank word or character in the file.
 ...............................................................
 FASTA            >
 NBRF             >P1;  or >D1;
 EMBL/SWISS       ID
 GDE protein      % 
 GDE nucleotide   # 
 CLUSTAL          CLUSTAL (blocked multiple alignments)
 GCG/MSF          PILEUP      "        "         "

Note, that the only way of spotting that a file is MSF format is if the word PILEUP appears at the very beginning of the file. If you produce this format from software other than the GCG pileup program, then you will have to insert the word PILEUP at the start of the file. Similarly, if you use clustal format, the word CLUSTAL must appear first.

All of these formats can be used to read in AN EXISTING FULL ALIGNMENT. With CLUSTAL format, this is just the same as the output format of this program and Clustal V. If you use PILEUP or CLUSTAL format, all sequences must be the same length, INCLUDING GAPS ("_" in clustal format; "." in MSF). With the other formats, sequences can be gapped with "-" charcters. If you read in any gaps these are kept during any later alignments. You can use this facility to read in an alignment in order to calculate a phylogenetic tree OR to output the same alignment in a different format (from the output format options menu of the multiple alignment menu) e.g. read in a GCG/ MSF format alignment and output a PHYLIP format alignment. This is also useful to read in one reference alignment and to add one or more new sequences to it using the "profile alignment" facilities.

DNA vs. PROTEIN: the program will count the number of A,C,G,T,U and N charcters. If 85% or more of the characters in a sequence are as above, then DNA/RNA is assumed, protein otherwise.

IMPORTANT : Additional note from Toby Gibson :

Clustal W NEVER DELETES gaps in the original alignment. This facility allows you to add new sequences to an alignment that already has gaps in. Then the old alignment stays in good quality (if it was good before....). Parameter "reset gaps between alignments" only deletes NEW gaps just added in an alignment run. This option is for use if you align the same sequences twice without leaving the program eg to try different gap penalties. In fact it is INCORRECT to do a PILEUP alignment first, although Clustal W can read and write these alignments for compatibility. It is better to use the GCG command "etopir @sequences.lis" where sequences.lis is a file of sequence entry names to get your sequences, this uses EGCG etopir command.


FILE OUTPUT

the alignments.

In the multiple alignment and profile alignment menus, there is a menu item to control the output format(s).

The alignment output format can be set to any (or all) of: CLUSTAL (a self explanatory blocked alignment) NBRF/PIR (same as input format but with "-" characters for gaps) MSF (the main GCG package multiple alignment format) PHYLIP (Joe Felsenstein's phylogeny inference package. Gaps are set to "-" characters. For some programs (e.g. PROTPARS/DNAPARS) these should be changed to "?" characters for unknown residues. GDE (Used by Steven Smith's GDE package)

You can also choose between having the sequences in the same order as in the input file or writing them out in an order that more closely matches the order used to carry out the multiple alignment.

The trees.

Believe it or not, we now use the New Hampshire (nested parentheses) format as default for our trees. This format is compatible with e.g. the PHYLIP package. If you want to view a tree, you can use the RETREE or DRAWGRAM/DRAWTREE programs of PHYLIP. This format is used for all our trees, even the initial guide trees for deciding the order of multiple alignment. The output trees from the phylogenetic tree menu can also be requested in our old verbose/cryptic format. This may be more useful if, for example, you wish to see the bootstrap figures. The bootstrap trees in the default New Hampshire format give the bootstrap figures as extra labels which can be viewed very easily using TREETOOL which is available as part of the GDE package. TREETOOL is available from the RDP project by ftp from rdp.life.uiuc.edu.

The New Hampshire format is only useful if you have software to display or manipulate the trees. The PHYLIP package is highly recommended if you intend to do much work with trees and includes programs for doing this. If you do not have such software, request the trees in the older clustal format and see the documentation for Clustal V (clustalv.doc). WE DO NOT PROVIDE ANY DIRECT MEANS FOR VIEWING TREES GRAPHICALLY.


THE ALIGNMENT ALGORITHMS

The basic algorithm is the same as for Clustal V and is described in some detail in clustalv.doc. The new modifications are described in detail in clustalw.ms. Here we just list some notes to help answer some of the most obvious questions.

Terminal Gaps

In the original Clustal V program, terminal gaps were penalised the same as all other gaps. This caused some ugly side effects e.g.


 acgtacgtacgtacgt                              acgtacgtacgtacgt
 a----cgtacgtacgt  gets the same score as      ----acgtacgtacgt
NOW, terminal gaps are free. This is better on average and stops silly effects like single residues jumping to the edge of the alignment. However, it is not perfect. It does mean that if there should be a gap near the end of the alignment, the program may be reluctant to insert it i.e.


 cccccgggccccc                                              cccccgggccccc
 ccccc---ccccc  may be considered worse (lower score) than  cccccccccc---
In the right hand case above, the terminal gap is free and may score higher than the laft hand alignment. This can be prevented by lowering the gap opening and extension penalties. It is difficult to get this right all the time. Please watch the ends of your alignments.

Speed of the initial (pairwise) alignments (fast approximate/slow accurate)

By default, the initial pairwise alignments are now carried out using a full dynamic programming algorithm. This is more accurate than the older hash/ k-tuple based alignments (Wilbur and Lipman) but is MUCH slower. On a fast workstation you may not notice but on a slow box, the difference is extreme. You can set the alignment method from the menus easily to the older, faster method.

Delaying alignment of distant sequences

The user can set a cut off to delay the alignment of the most divergent sequences in a data set until all other sequences have been aligned. By default, this is set to 40% which means that if a sequence is less than 40% identical to any other sequence, its alignment will be delayed.

Iterative realignment/Reset gaps between alignments

By default, if you align a set of sequences a second time (e.g. with changed gap penalties), the gaps from the first alignment are discarded. You can set this from the menus so that older gaps will be kept between alignments, This can sometimes give better alignments by keeping the gaps (do not reset them) and doing the full multiple alignment a second time. Sometimes, the alignment will converge on a better solution; sometimes the new alignment will be the same as the first. There can be a strange side effect: you can get columns of nothing but gaps introduced.

Any gaps that are read in from the input file are always kept, regardless of the setting of this switch. If you read in a full multiple alignment, the "reset gaps" switch has no effect. The old gaps will remain and if you carry out a multiple alignment, any new gaps will be added in. If you wish to carry out a full new alignment of a set of sequences that are already aligned in a file you must input the sequences without gaps.

Profile alignment

By profile alignment, we simply mean the alignment of old alignments/sequences. In this context, a profile is just an existing alignment (or even a set of unaligned sequences; see below). This allows you to read in an old alignment (in any of the allowed input formats) and align one or more new sequences to it. From the profile alignment menu, you are allowed to read in 2 profiles. Either profile can be a full alignment OR a single sequence. In the simplest mode, you simply align the two profiles to each other. This is useful if you want to gradually build up a full multiple alignment.

A second option is to align the sequences from the second profile, one at a time to the first profile. This is done, taking the underlying tree between the sequences into account. This is useful if you have a set of new sequences (not aligned) and you wish to add them all to an older alignment.


CHANGES TO THE PHYLOGENTIC TREE CALCULATIONS AND SOME HINTS.

IMPROVED DISTANCE CALCULATIONS FOR PROTEIN TREES

The phylogenetic trees in Clustal W (the real trees that you calculate AFTER alignment; not the guide trees used to decide the branching order for multiple alignment) use the Neighbor-Joining method of Saitou and Nei based on a matrix of "distances" between all sequences. These distances can be corrected for "multiple hits". This is normal practice when accurate trees are needed. This correction stretches distances (especially large ones) to try to correct for the fact that OBSERVED distances (mean number of differences per site) greatly underestimate the actual number that happened during evolution.

In Clustal V we used a simple formula to convert an observed distance to one that is corrected for multiple hits. The observed distance is the mean number of differences per site in an alignment (ignoring sites with a gap) and is therefore always between 0.0 (for ientical sequences) an 1.0 (no residues the same at any site). These distances can be multiplied by 100 to give percent difference values. 100 minus percent difference gives percent identity. The formula we use to correct for multiple hits is from Motoo Kimura (Kimura, M. The neutral Theory of Molecular Evolution, Camb.Univ.Press, 1983, page 75) and is:


  K = -Ln(1 - D - (D.D)/5)  where D is the observed distance and K is       
                          corrected distance.
This formula gives mean number of estimated substitutions per site and, in contrast to D (the observed number), can be greater than 1 i.e. more than one substitution per site, on average. For example, if you observe 0.8 differences per site (80% difference; 20% identity), then the above formula predicts that there have been 2.5 substitutions per site over the course of evolution since the 2 sequences diverged. This can also be expressed in PAM units by multiplying by 100 (mean number of substitutions per 100 residues). The PAM scale of evolution and its derivation/calculation comes from the work of Margaret Dayhoff and co workers (the famous Dayhoff PAM series of weight matrices also came from this work). Dayhoff et al constructed an elaborate model of protein evolution based on observed frequencies of substitution between very closely related proteins. Using this model, they derived a table relating observed distances to predicted PAM distances. Kimura's formula, above, is just a "curve fitting" approximation to this table. It is very accurate in the range 0.75 > D > 0.0 but becomes increasingly unaccurate at high D (>0.75) and fails completely at around D = 0.85.

To circumvent this problem, we calculated all the values for K corresponding to D above 0.75 directly using the Dayhoff model and store these in an internal table, used by Clustal W. This table is declared in the file dayhoff.h and gives values of K for all D between 0.75 and 0.93 in intervals of 0.001 i.e. for D = 0.750, 0.751, 0.752 ...... 0.929, 0.930. For any observed D higher than 0.930, we arbitrarily set K to 10.0. This sounds drastic but with real sequences, distances of 0.93 (less than 7% identity) are rare. If your data set includes sequences with this degree of divergence, you will have great difficulty getting accurate trees by ANY method; the alignment itself will be very difficult (to construct and to evaluate).

There are some important things to note. Firstly, this formula works well if your sequences are of average amino acid composition and if the amino acids substitute according to the original Dayhoff model. In other cases, it may be misleading. Secondly, it is based only on observed percent distance i.e. it does not DIRECTLY take conservative substitutions into account. Thirdly, the error on the estimated PAM distances may be VERY great for high distances; at very high distance (e.g. over 85%) it may give largely arbitrary corrected distances. In most cases, however, the correction is still worth using; the trees will be more accurate and the branch lengths will be more realistic.

A far more sophisticated distance correction based on a full Dayhoff model which DOES take conservative substitutions and actual amino acid composition into account, may be found in the PROTDIST program of the PHYLIP package. For serious tree makers, this program is highly recommended.

TWO NOTES ON BOOTSTRAPPING...

When you use the BOOTSTRAP in Clustal W to estimate the reliability of parts of a tree, many of the uncorrected distances may randomly exceed the arbitrary cut off of 0.93 (sequences only 7% identical) if the sequences are distantly related. This will happen randomly i.e. even if none of the pairs of sequences are less than 7% identical, the bootstrap samples may contain pairs of sequences that do exceed this cut off. If this happens, you will be warned. In practice, this can happen with many data sets. It is not a serious problem if it happens rarely. If it does happen (you are warned when it happens and told how often the problem occurs), you should consider removing the most distantly related sequences and/or using the PHYLIP package instead.


A further problem arises in almost exactly the opposite situation: when you bootstrap a data set which contains 3 or more sequences that are identical or almost identical. Here, the sets of identical sequences should be shown as a multifurcation (several sequences joing at the same part of the tree). Because the Neighbor-Joining method only gives strictly dichotomous trees (never more than 2 sequences join at one time), this cannot be exactly represented. In practice, this is NOT a problem as there will be some internal branches of zero length seperating the sequences. If you display the tree with all branch lengths, you will still see a multifurcation. However, when you bootstrap the tree, only the branching orders are stored and counted. In the case of multifurcations, the exact branching order is arbitrary but the program will always get the same branching order, depending only on the input order of the sequences. In practice, this is only a problem in situations where you have a set of sequences where all of them are VERY similar. In this case, you can find very high support for some groupings which will disappear if you run the analysis with a different input order. Again, the PHYLIP package deals with this by offering a JUMBLE option to shuffle the input order of your sequences between each bootstrap sample.


ay score higher than the laft hand alignment. This can be prevented by lowering the gap opening and extension penalties. It is difficult to get this right all the time. Please watch the ends of your alignments.

Speed of the initial (pairwise) alignments (fast approximate/slow accurate)ClustalW1.7/clustalw-article.html010064400063530000012000001270400636737122100202470ustar00pingouinstaff00000400000036Clustal W : Nucleic Acids Research, 22(22):4673-4680
This is just an ASCII text version of the manuscript describing
Clustal W, without the figures.  It was published:

Nucleic Acids Research, 22(22):4673-4680.



CLUSTAL W: improving the sensitivity of progressive multiple 
sequence alignment through sequence weighting, position specific 
gap penalties and weight matrix choice.



Julie D. Thompson, Desmond G. Higgins1 and Toby J. Gibson*

European Molecular Biology Laboratory
Postfach 102209
Meyerhofstrasse 1
D-69012 Heidelberg
Germany


Phone:		+49-6221-387398
Fax:		+49-6221-387306
E-mail:		Gibson@EMBL-Heidelberg.DE
		Des.Higgins@EBI.AC.UK
		Thompson@EMBL-Heidelberg.DE


Keywords:	Multiple alignment, phylogenetic tree, weight matrix, gap
		penalty, dynamic programming, sequence weighting.


1 Current address: 
European Bioinformatics Institute
Hinxton Hall
Hinxton
Cambridge CB10 1RQ
UK.

* To whom correspondence should be addressed


ABSTRACT

The sensitivity of the commonly used progressive multiple sequence 
alignment method has been greatly improved for the alignment of divergent 
protein sequences.   Firstly, individual weights are assigned to each sequence 
in a partial alignment in order to downweight near-duplicate sequences and 
upweight the most divergent ones.   Secondly, amino acid substitution 
matrices are varied at different alignment stages according to the divergence 
of the sequences to be aligned.    Thirdly, residue specific gap penalties and 
locally reduced gap penalties in hydrophilic regions encourage new gaps in 
potential loop regions rather than regular secondary structure.   Fourthly, 
positions in early alignments where gaps have been opened receive locally 
reduced gap penalties to encourage the opening up of new gaps at these 
positions.  These modifications are incorporated into a new program, 
CLUSTAL W which is freely available.  


INTRODUCTION

The simultaneous alignment of many nucleotide or amino acid sequences is 
now an essential tool in molecular biology.  Multiple alignments are used to 
find diagnostic patterns to characterise protein families; to detect or 
demonstrate homology between new sequences and existing families of 
sequences; to help predict the secondary and tertiary structures of new 
sequences; to suggest oligonucleotide primers for PCR; as an essential prelude 
to molecular evolutionary analysis.   The rate of appearance of new sequence 
data is steadily increasing and the development of efficient and accurate 
automatic methods for multiple alignment is, therefore, of major 
importance.   The majority of automatic multiple alignments are now carried 
out using the "progressive" approach of Feng and Doolittle (1).   In this paper, 
we describe a number of improvements to the progressive multiple 
alignment method which greatly improve the sensitivity without sacrificing 
any of the speed and efficiency which makes this approach so practical.  The 
new methods are made available in a program called CLUSTAL W which is 
freely available and portable to a wide variety of computers and operating 
systems.

In order to align just two sequences, it is standard practice to use dynamic 
programming (2).  This guarantees a mathematically optimal alignment, 
given a table of scores for matches and mismatches between all amino acids 
or nucleotides (e.g. the PAM250 matrix (3) or BLOSUM62 matrix (4)) and 
penalties for insertions or deletions of different lengths.   Attempts at 
generalising dynamic programming to multiple alignments are limited to 
small numbers of short sequences (5).  For much more than eight or so 
proteins of average length, the problem is uncomputable given current 
computer power.  Therefore, all of the methods capable of handling larger 
problems in practical timescales, make use of heuristics.    Currently, the most 
widely used approach is to exploit the fact that homologous sequences are 
evolutionarily related.  One can build up a multiple alignment progressively 
by a series of pairwise alignments, following the branching order in a 
phylogenetic tree (1).  One first aligns the most closely related sequences, 
gradually adding in the more distant ones.   This approach is sufficiently fast 
to allow alignments of virtually any size.   Further, in simple cases, the 
quality of the alignments is excellent, as judged by the ability to correctly align 
corresponding domains from sequences of known secondary or tertiary 
structure (6).  In more difficult cases, the alignments give good starting points 
for further automatic or manual refinement.

This approach works well when the data set consists of sequences of different 
degrees of divergence.   Pairwise alignment of very closely related sequences 
can be carried out very accurately.   The correct answer may often be obtained 
using a wide range of parameter values (gap penalties and weight matrix).  By 
the time the most distantly related sequences are aligned, one already has a 
sample of aligned sequences which gives important information about the 
variability at each position.   The positions of the gaps that were introduced 
during the early alignments of the closely related sequences are not changed 
as new sequences are added.   This is justified because the placement of gaps 
in alignments between closely related sequences is much more accurate than 
between distantly related ones.   When all of the sequences are highly 
divergent (e.g. less than approximately 25-30% identity between any pair of 
sequences), this progressive approach becomes much less reliable.

There are two major problems with the progressive approach:  the local 
minimum problem and the choice of alignment parameters.   The local 
minimum problem stems from the "greedy" nature of the alignment strategy.  
The algorithm greedily adds sequences together, following the initial tree.  
There is no guarantee that the global optimal solution, as defined by some 
overall measure of multiple alignment quality (7,8), or anything close to it, 
will be found.   More specifically, any mistakes (misaligned regions) made 
early in the alignment process cannot be corrected later as new information 
from other sequences is added.   This problem is frequently thought of as 
mainly resulting from an incorrect branching order in the initial tree.  The 
initial trees are derived from a matrix of distances between separately aligned 
pairs of sequences and are much less reliable than trees from complete 
multiple alignments.   In our experience, however, the real problem is caused 
simply by errors in the initial alignments.  Even if the topology of the guide 
tree is correct, each alignment step in the multiple alignment process may 
have some percentage of the residues misaligned.   This percentage will be 
very low on average for very closely related sequences but will increase as 
sequences diverge.   It is these misalignments which carry through from the 
early alignment steps that cause the local minimum problem.   The only way 
to correct this is to use an iterative or stochastic sampling procedure (e.g. 
7,9,10).   We do not directly address this problem in this paper.

The alignment parameter choice problem is, in our view, at least as serious as 
the local minimum problem.   Stochastic or iterative algorithms will be just 
as badly affected as progressive ones if the parameters are inappropriate: they 
will arrive at a false global minimum.  Traditionally, one chooses one weight 
matrix and two gap penalties (one for opening a new gap and one for 
extending an existing gap) and hope that these will work well over all parts of 
all the sequences in the data set.   When the sequences are all closely related, 
this works.  The first reason is that virtually all residue weight matrices give 
most weight to identities.   When identities dominate an alignment, almost 
any weight matrix will find approximately the correct solution.   With very 
divergent sequences, however, the scores given to non-identical residues will 
become critically important; there will be more mismatches than identities.   
Different weight matrices will be optimal at different evolutionary distances 
or for different classes of proteins.  

The second reason is that the range of gap penalty values that will find the 
correct or best possible solution can be very broad for highly similar sequences 
(11).   As more and more divergent sequences are used, however, the exact 
values of the gap penalties become important for success.   In each case, there 
may be a very narrow range of values which will deliver the best alignment.  
Further, in protein alignments, gaps do not occur randomly (i.e. with equal 
probability at all positions).  They occur far more often between the major 
secondary structural elements of alpha helices and beta strands than within 
(12).

The major improvements described in this paper attempt to address the 
alignment parameter choice problem.   We dynamically vary the gap 
penalties in a position and residue specific manner. The observed relative 
frequencies of gaps adjacent to each of the 20 amino acids (12) are used to 
locally adjust the gap opening penalty after each residue.   Short stretches of 
hydrophilic residues (e.g. 5 or more) usually indicate loop or random coil 
regions and the gap opening penalties are locally reduced in these stretches.   
In addition, the locations of the gaps found in the early alignments are also 
given reduced gap opening penalties.  It has been observed in alignments 
between sequences of known structure that gaps tend not to be closer than 
roughly eight residues on average (12).   We increase the gap opening penalty 
within eight residues of exising gaps.   The two main series of amino acid 
weight matrices that are used today are the PAM series (3) and the BLOSUM 
series (4).   In each case, there is a range of matrices to choose from.  Some 
matrices are appropriate for aligning very closely related sequences where 
most weight by far is given to identities, with only the most frequent 
conservative substitutions receiving high scores.  Other matrices work better 
at greater evolutionary distances where less importance is attached to 
identities (13).  We choose different weight matrices, as the alignment 
proceeds, depending on the estimated divergence of the sequences to be 
aligned at each stage.  

Sequences are weighted to correct for unequal sampling across all 
evolutionary distances in the data set (14).   This downweights sequences that 
are very similar to other sequences in the data set and upweights the most 
divergent ones.  The weights are calculated directly from the branch lengths 
in the initial guide tree (15).   Sequence weighting has already been shown to 
be effective in improving the sensitivity of profile searches (15,16).  In the 
original CLUSTAL programs (17-19), the initial guide trees, used to guide the 
multiple alignment, were calculated using the UPGMA method (20).  We 
now use the Neighbour-Joining method (21) which is more robust against the 
effects of unequal evolutionary rates in different lineages and which gives 
better estimates of individual branch lengths.  This is useful because it is these 
branch lengths which are used to derive the sequence weights.  We also allow 
users to choose between fast approximate alignments (22) or full dynamic 
programming for the distance calculations used to make the guide tree. 

The new improvements dramatically improve the sensitivity of the 
progressive alignment method for difficult alignments involving highly 
diverged sequences.  We show one very demanding test case of over 60 SH3 
domains (23) which includes sequence pairs with as little as 12% identity and 
where there is only one exactly conserved residue across all of the sequences.   
Using default parameters, we can achieve an alignment that is almost exactly 
correct, according to available structural information (24).   Using the program 
in a wide variety of situations, we find that it will normally find the correct 
alignment, in all but the most difficult and pathological of cases.  


MATERIAL AND METHODS


The basic alignment method

The basic multiple alignment algorithm consists of three main stages: 1) all 
pairs of sequences are aligned separately in order to calculate a distance matrix 
giving the divergence of each pair of sequences; 2) a guide tree is calculated 
from the distance matrix; 3) the sequences are progressively aligned according 
to the branching order in the guide tree.   An example using 7 globin 
sequences of known tertiary structure (25) is given in figure 1.


1) The distance matrix/pairwise alignments

In the original CLUSTAL programs, the pairwise distances were calculated 
using a fast approximate method (22).   This allows very large numbers of 
sequences to be aligned, even on a microcomputer.   The scores are calculated 
as the number of k-tuple matches (runs of identical residues, typically 1 or 2 
long for proteins or 2 to 4 long for nucleotide sequences) in the best alignment 
between two sequences minus a fixed penalty for every gap.   We now offer a 
choice between this method and the slower but more accurate scores from full 
dynamic programming alignments using two gap penalties (for opening or 
extending gaps) and a full amino acid weight matrix.   These scores are 
calculated as the number of identities in the best alignment divided by the 
number of residues compared (gap positions are excluded).   Both of these 
scores are initially calculated as percent identity scores and are converted to 
distances by dividing by 100 and subtracting from 1.0 to give number of 
differences per site.   We do not correct for multiple substitutions in these 
initial distances.   In figure 1 we give the 7x7 distance matrix between the 7 
globin sequences calculated using the full dynamic programming method.


2) The guide tree

The trees used to guide the final multiple alignment process are calculated 
from the distance matrix of step 1 using the Neighbour-Joining method (21).   
This produces unrooted trees with branch lengths proportional to estimated 
divergence along each branch.   The root is placed by a "mid-point" method 
(15) at a position where the means of the branch lengths on either side of the 
root are equal.   These trees are also used to derive a weight for each sequence 
(15).   The weights are dependent upon the distance from the root of the tree 
but sequences which have a common branch with other sequences share the 
weight derived from the shared branch.   In the example in figure 1, the 
leghaemoglobin (Lgb2_Luplu) gets a weight of 0.442 which is equal to the 
length of the branch from the root to it.  The Human beta globin 
(Hbb_Human) gets a weight consisting of the length of the branch leading to 
it that is not shared with any other sequences (0.081) plus half the length of 
the branch shared with the horse beta globin (0.226/2) plus one quarter the 
length of the branch shared by all four haemoglobins (0.061/4) plus one fifth 
the branch shared between the haemoglobins and the myoglobin (0.015/5) 
plus one sixth the branch leading to all the vertebrate globins (0.062).  This 
sums to a total of 0.221.  By contrast, in the normal progressive alignment 
algorithm, all sequences would be equally weighted.  The rooted tree with 
branch lengths and sequence weights for the 7 globins is given in figure 1.  


3) Progressive alignment

The basic procedure at this stage is to use a series of pairwise alignments to 
align larger and larger groups of sequences, following the branching order in 
the guide tree.   You proceed from the tips of the rooted tree towards the root.   
In the globin example in figure 1 you align the sequences in the following 
order: human vs. horse beta globin; human vs. horse alpha globin; the 2 
alpha globins vs. the 2 beta globins; the myoglobin vs. the haemoglobins; the 
cyanohaemoglobin vs the haemoglobins plus myoglobin; the leghaemoglobin 
vs. all the rest.  At each stage a full dynamic programming (26,27) algorithm is 
used with a residue weight matrix and penalties for opening and extending 
gaps.   Each step consists of aligning two existing alignments or sequences.  
Gaps that are present in older alignments remain fixed.  In the basic 
algorithm, new gaps that are introduced at each stage get full gap opening and 
extension penalties, even if they are introduced inside old gap positions (see 
the section on gap penalties below for modifications to this rule).  In order to 
calculate the score between a position from one sequence or alignment and 
one from another, the average of all the pairwise weight matrix scores from 
the amino acids in the two sets of sequences is used i.e. if you align 2 
alignments with 2 and 4 sequences respectively, the score at each position is 
the average of 8 (2x4) comparisons.   This is illustrated in figure 2.  If either set 
of sequences contains one or more gaps in one of the positions being 
considered, each gap versus a residue is scored as zero.   The default amino 
acid weight matrices we use are rescored to have only positive values. 
Therefore, this treatment of gaps treats the score of a residue versus a gap as 
having the worst possible score.  When sequences are weighted (see 
improvements to progressive alignment, below), each weight matrix value is 
multiplied by the weights from the 2 sequences, as illustrated in figure 2.


Improvements to progressive alignment

All of the remaining modifications apply only to the final progressive 
alignment stage.   Sequence weighting is relatively straightforward and is 
already widely used in profile searches (15,16).   The treatment of gap penalties 
is more complicated.   Initial gap penalties are calculated depending on the 
weight matrix, the similarity of the sequences, and the length of the 
sequences. Then, an attempt is made to derive sensible local gap opening 
penalties at every position in each pre-aligned group of sequences that will 
vary as new sequences are added.   The use of different weight matrices as the 
alignment progresses is novel and largely by-passes the problem of initial 
choice of weight matrix.   The final modification allows us to delay the 
addition of very divergent sequences until the end of the alignment process 
when all of the more closely related sequences have already been aligned.


Sequence weighting

Sequence weights are calculated directly from the guide tree.    The weights 
are normalised such that the biggest one is set to 1.0 and the rest are all less 
than one.  Groups of closely related sequences receive lowered weights 
because they contain much duplicated information.  Highly divergent 
sequences without any close relatives receive high weights.  These weights 
are used as simple multiplication factors for scoring positions from different 
sequences or prealigned groups of sequences.  The method is illustrated in 
figure 2.  In the globin example in figure 1, the two alpha globins get 
downweighted because they are almost duplicate sequences (as do the two 
beta globins); they receive a combined weight of only slightly more than if a 
single alpha globin was used.   


Initial gap penalties

Initially, two gap penalties are used: a gap opening penalty (GOP) which gives 
the cost of opening a new gap of any length and a gap extension penalty (GEP) 
which gives the cost of every item in a gap.  Initial values can be set by the 
user from a menu.   The software then automatically attempts to choose 
appropriate gap penalties for each sequence alignment, depending on the 
following factors.

1) Dependence on the weight matrix

It has been shown (16,28) that varying the gap penalties used with different 
weight matrices can improve the accuracy of sequence alignments. Here, we 
use the average score for two mismatched residues (ie. off-diagonal values in 
the matrix) as a scaling factor for the GOP.

2) Dependence on the similarity of the sequences

The percent identity of the two (groups of) sequences to be aligned is used to 
increase the GOP for closely related sequences and decrease it for more 
divergent sequences on a linear scale.

3) Dependence on the lengths of the sequences   

The scores for both true and false sequence alignments grow with the length 
of the sequences. We use the logarithm of the length of the shorter sequence 
to increase the GOP with sequence length.

Using these three modifications, the initial GOP calculated by the program is:

GOP->(GOP+log(MIN(N,M))) * (average residue mismatch score) *
                                                               (percent identity scaling factor)
where N, M are the lengths of the two sequences.

4) Dependence on the difference in the lengths of the sequences

The GEP is modified depending on the difference between the lengths of the 
two sequences to be aligned. If one sequence is much shorter than the other, 
the GEP is increased to inhibit too many long gaps in the shorter sequence.
The initial GEP calculated by the program is:

GEP ->  GEP*(1.0+|log(N/M)|) 
where N, M are the lengths of the two sequences.


Position-specific gap penalties

 In most dynamic programming applications, the initial gap opening and 
extension penalties are applied equally at every position in the sequence, 
regardless of the location of a gap, except for terminal gaps which are usually 
allowed at no cost.   In CLUSTAL W, before any pair of sequences or 
prealigned groups of sequences are aligned, we generate a table of gap opening 
penalties for every position in the two (sets of) sequences.  An example is 
shown in figure 3.  We manipulate the initial gap opening penalty in a 
position specific manner, in order to make gaps more or less likely at different 
positions.   

The local gap penalty modification rules are applied in a hierarchical manner.   
The exact details of each rule are given below.  Firstly, if there is a gap at a 
position, the gap opening and gap extension penalties are lowered; the other 
rules do not apply.   This makes gaps more likely at positions where there are 
already gaps.  If there is no gap at a position, then the gap opening penalty is 
increased if the position is within 8 residues of an existing gap.   This 
discourages gaps that are too close together.  Finally, at any position within a 
run of hydrophilic residues, the penalty is decreased.  These runs usually 
indicate loop regions in protein structures.  If there is no run of hydrophilic 
residues, the penalty is modified using a table of residue specific gap 
propensities (12).   These propensities were derived by counting the frequency 
of each residue at either end of gaps in alignments of proteins of known 
structure.  An illustration of the application of these rules from one part of 
the globin example, in figure 1, is given in figure 3.  

1) Lowered gap penalties at existing gaps

If there are already gaps at a position, then the GOP is reduced in proportion 
to the number of sequences with a gap at this position and the GEP is lowered 
by a half.  The new gap opening penalty is calculated as:

GOP ->  GOP*0.3*(no. of sequences without a gap/no. of sequences).

2) Increased gap penalties near existing gaps

If a position does not have any gaps but is within 8 residues of an existing gap, 
the GOP is increased by:

GOP ->  GOP*(2+((8-distance from gap)*2)/8)

3) Reduced gap penalties in hydrophilic stretches

Any run of 5 hydrophilic residues is considered to be a hydrophilic stretch.  
The residues that are to be considered hydrophilic may be set by the user but 
are conservatively set to D, E, G, K, N, Q, P, R or S by default.   If, at any 
position, there are no gaps and any of the sequences has such a stretch, the 
GOP is reduced by one third.


4) Residue specific penalties

If there is no hydrophilic stretch and the position does not contain any gaps, 
then the GOP is multiplied by one of the 20 numbers in table 1, depending on 
the residue.  If there is a mixture of residues at a position, the multiplication 
factor is the average of all the contributions from each sequence.  


Weight matrices

Two main series of weight matrices are offered to the user: the Dayhoff PAM 
series (3) and the BLOSUM series (4).   The default is the BLOSUM series.  In 
each case, there is a choice of matrix ranging from strict ones, useful for 
comparing very closely related sequences to very "soft" ones that are useful 
for comparing very distantly related sequences.   Depending on the distance 
between the two sequences or groups of sequences to be compared, we switch 
between 4 different matrices.  The distances are measured directly from the 
guide tree.  The ranges of distances and tables used with the PAM series of 
matrices is: 80-100%:PAM20, 60-80%:PAM60, 40-60%:PAM120, 0-40%:PAM350. 
The range used with the BLOSUM series is:80-100%:BLOSUM80,
60-80%:BLOSUM62, 30-60%:BLOSUM45, 0-30%:BLOSUM30.


Divergent sequences

The most divergent sequences (most different, on average from all of the 
other sequences) are usually the most difficult to align correctly.  It is 
sometimes better to delay the incorporation of these sequences until all of the 
more easily aligned sequences are merged first.  This may give a better chance 
of correctly placing the gaps and matching weakly conserved positions against 
the rest of the sequences.   A choice is offered to set a cut off (default is 40% 
identity or less with any other sequence) that will delay the alignment of the 
divergent sequences until all of the rest have been aligned.  


Software and Algorithms


Dynamic Programming

The most demanding part of the multiple alignment strategy, in terms of 
computer processing and memory usage, is the alignment of two (groups of) 
sequences at each step in the final progressive alignment.   To make it 
possible to align very long sequences (e.g. dynein heavy chains at ~ 5,000 
residues) in a reasonable amount of memory, we use the memory efficient 
dynamic programming algorithm of Myers and Miller (26).   This sacrifices 
some processing time but makes very large alignments practical in very little 
memory.   One disadvantage of this algorithm is that it does not allow 
different gap opening and extension penalties at each position.  We have 
modified the algorithm so as to allow this and the details are described in a 
separate paper (27).   



Menus/file formats

Six different sequence input formats are detected automatically and read by 
the program:  EMBL/Swiss Prot, NBRF/PIR, Pearson/FASTA (29), GCG/MSF 
(30), GDE (Steven Smith, Harvard University Genome Center) and CLUSTAL 
format alignments.   The last three formats allow users to read in complete 
alignments (e.g. for calculating phylogenetic trees or for addition of new 
sequences to an existing alignment).   Alignment output may be requested in 
standard CLUSTAL format (self-explanatory blocked alignments) or in 
formats compatible with the GDE, PHYLIP (31) or GCG (30) packages.   The 
program offers the user the ability to calculate Neighbour-Joining 
phylogenetic trees from existing alignments with options to correct for 
multiple hits (32,33) and to estimate confidence levels using a bootstrap 
resampling procedure (34).   The trees may be output in the "New 
Hampshire" format that is compatible with the PHYLIP package (31).

Alignment to an alignment

Profile alignment is used to align two existing alignments (either of which 
may consist of just one sequence) or to add a series of new sequences to an 
existing alignment.   This is useful because one may wish to build up a 
multiple alignment gradually, choosing different parameters manually, or 
correcting intermediate errors as the alignment proceeds.   Often, just a few 
sequences cause misalignments in the progressive algorithm and these can be 
removed from the process and then added at the end by profile alignment.  A 
second use is where one has a high quality reference alignment and wishes to 
keep it fixed while adding new sequences automatically.  


Portability/Availability

The full source code of the package is provided free to academic users.   The 
program will run on any machine with a full ANSI conforming C compiler.  
It has been tested on the following hardware/software combinations:  
Decstation/Ultrix, Vax or ALPHA/VMS, Silicon Graphics/IRIX.   The source 
code and documentation are available by E-mail from the EMBL file server 
(send the words HELP and HELP SOFTWARE on two lines to the internet 
address: 
Netserv@EMBL-Heidelberg.DE) or by anonymous FTP from 
FTP.EMBL-Heidelberg.DE.  Queries may be addressed by E-mail to 
Des.Higgins@EBI.AC.UK or Gibson@EMBL-Heidelberg.DE.


RESULTS AND DISCUSSION


Alignment of SH3 Domains

The ~60 residue SH3 domain was chosen to illustrate the performance of 
CLUSTAL W, as there is a reference manual alignment (23) and the fold is 
known (24).  SH3 domains, with a minimum similarity below 12% identity, 
are poorly aligned by progressive alignment programs such as CLUSTAL V 
and PILEUP: neither program can generate the correct blocks corresponding to 
the secondary structure elements. 

Figure 4 shows an alignment generated by CLUSTAL W of the example set of 
SH3 domains. The alignment was generated in two steps. After progressive 
alignment, five blocks were produced, corresponding to structural elements, 
with gaps inserted exclusively in the known loop regions. The beta strands in 
blocks 1, 4 and 5 were all correctly superposed. However, four sequences in 
block 2 and one sequence in block 3 were misaligned by 1-2 residues 
(underlined in figure 4). A second progressive alignment of the aligned 
sequences, including the gaps, improved this alignment: A single misaligned 
sequence, H_P55, remains in block 2 (boxed in figure 4), while block 3 is now 
completely aligned.  This alignment corrects several errors (eg. P85A, P85B 
and FUS1) in the manual alignment (23).

The SH3 alignment illustrates several features of CLUSTAL W usage. Firstly, 
in a practical application involving divergent sequences, the initial 
progressive alignment is likely to be a good but not perfect approximation to 
the correct alignment. The alignment quality can be improved in a number of 
ways. If the block structure of the alignment appears to be correct, realignment 
of the alignment will usually improve most of the misaligned blocks: the 
existing gaps allow the blocks to "float" cheaply to a locally optimal position 
without disturbing the rest of the alignment. Remaining sequences which are 
doubtfully aligned can then be individually tested by profile alignment to the 
remainder: the misaligned H_P55 SH3 domain can be correctly aligned by 
profile (with GOP <= 8). The indel regions in the final alignment can then be 
manually cleaned up: Usually the exact alignment in the loop regions is not 
determinable, and may have no meaning in structural terms. It is then 
desirable to have a single gap per structural loop. CLUSTAL W achieved this 
for two of the four SH3 loop regions (figure 4).

If the block structure of the alignment appears suspect, greater intervention by 
the user may be required. The most divergent sequences, especially if they 
have large insertions (which can be discerned with the aid of dot matrix 
plots), should be left out of the progressive alignment. If there are sets of 
closely related sequences that are deeply diverged from other sets, these can be 
separately aligned and then merged by profile alignment. Incorrectly 
determined sequences, containing frameshifts, can also confound regions of 
an alignment: these can be hard to detect but sometimes they have been 
grouped within the excluded divergent sequences: then they may be revealed 
when they are individually compared to the alignment as having apparently 
nonsense segments with respect to the other sequences. 



Finding the best alignment

In cases where all of the sequences in a data set are very similar (e.g. no pair 
less than 35% identical), CLUSTAL W will find an alignment which is 
difficult to improve by eye.  In this sense, the alignment is optimal with 
regard to the alternative of manual alignment.  Mathematically, this is vague 
and can only be put on a more systematic footing by finding an objective 
function (a measure of multiple alignment quality) that exactly mirrors the 
information used by an "expert" to evaluate an alignment.  Nonetheless, if an 
alignment is impossible to improve by eye, then the program has achieved a 
very useful result.   

In more difficult cases, as more divergent sequences are included, it becomes 
increasingly difficult to find good alignments and to evaluate them.    What 
we find with CLUSTAL W is that the basic block-like structure of the 
alignment (corresponding to the major secondary structure elements) is 
usually recovered, with some of the most divergent sequences misaligned in 
small regions.  This is a very useful starting point for manual refinement as it 
helps define the major blocks of similarity.   The problem sequences can be 
removed from the analysis and realigned to the rest of the sequences 
automatically or with different parameter settings.   An examination of the 
tree used to guide the alignment will usually show which sequences will be 
most unreliably placed (those that branch off closest to the root and/or those 
that align to other single sequences at a very low level of sequence identity 
rather than align to a group of pre-aligned sequences).  Finally, one can 
simply iterate the multiple alignment process by feeding an output alignment 
back into CLUSTAL W and repeating the multiple alignment process (using 
the same or different parameters).   The SH3 domain alignment in figure 4 
was derived in this way by 2 passes using default parameters.  In the second 
pass, the local gap penalties are dominated by the placement of the initial 
major gap positions.  The alignment will either remain unchanged or will 
converge rapidly (after 1 or 2 extra passes) on a better solution.  If the 
placement of the initial gaps is approximately correct but some of the 
sequences are locally misaligned, this works well.  


Comparison with other methods

Recently, several papers have addressed the problem of position specific 
parameters for multiple alignment.  In one case (35), local gap penalties are 
increased in alpha helical and beta strand regions, when the 3-D structures of 
one or more of the sequences are known.  In a second case (36), a hidden 
Markov model was used to estimate position specific gap penalties and 
residue substitution weight matrices when large numbers of examples of a 
protein domain were known.  With CLUSTAL W, we attempt to derive the 
same information purely from the set of sequences to be aligned.  Therefore, 
we can apply the method to any set of sequences.  The success of this approach 
will depend on the number of available sequences and their evolutionary 
relationships.  It will also depend on the decision making process during 
multiple alignment (e.g. when to change weight matrix) and the accuracy and 
appropriateness of our parameterisation.  In the long term, this can only be 
evaluated by exhaustive testing of sets of sequences where the correct 
alignment (or parts of it) are known from structural information.   What is 
clear, however, is that the modifications described here significantly improve 
the sensitivity of the progressive multiple alignment approach.  This is 
achieved with almost no sacrifice in speed and efficiency.  

There are several areas where further improvements in sensitivity and 
accuracy can be made.  Firstly, the residue weight matrices and gap settings 
can be made more accurate as more and more data accumulate, while 
matrices for specific sequence types can be derived (e.g. for transmembrane 
regions (37)).  Secondly, stochastic or iterative optimisation methods can be 
used to refine initial alignments (7,9,10).   CLUSTAL W could be run with 
several sets of starting parameters and in each case, the alignments refined 
according to an objective function.   The search for a good objective function, 
that takes into account the sequence and position specific information used in 
CLUSTAL W is a key area of research.   Finally, the average number of 
examples of each protein domain or family is growing steadily.  It is not only 
important that programs can cope with the large volumes of data that are 
being generated, they should be able to exploit the new information to make 
the alignments more and more accurate.   Globally optimal alignments 
(according to an objective function) may not always be possible but the 
problem may be avoided if sufficiently large volumes of data become 
available.  CLUSTAL W is a step in this direction.

ACKNOWLEDGEMENTS

Numerous people have offered advice and suggestions for improvements to 
earlier versions of the CLUSTAL programs.  D.H. wishes to apologise to all of 
the irate CLUSTAL V users who had to live with the bugs and lack of facilities 
for getting trees in the New Hampshire format.  We wish to specifically thank 
Jeroen Coppieters who suggested using a series of weight matrices and Steven 
Henikoff for advice on using the BLOSUM matrices.  We are grateful to Rein 
Aasland, Peer Bork, Ariel Blocker and BŽrtrand Seraphin for providing 
challenging alignment problems.   T.G. and J.T. thank Kevin Leonard for 
support and encouragement.  Finally, we thank all of the people who were 
involved with various CLUSTAL programs over the years, namely: Paul 
Sharp, Rainer Fuchs and Alan Bleasby.


REFERENCES

 1.Feng, D.-F. and Doolittle, R.F. (1987). J. Mol. Evol. 25, 351-360.
 2.Needleman, S.B. and Wunsch, C.D. (1970). J. Mol. Biol. 48, 443-453.
 3.Dayhoff, M.O., Schwartz, R.M. and Orcutt, B.C. (1978)  in Atlas of Protein 
Sequence and Structure, vol. 5, suppl. 3 (Dayhoff, M.O., ed.), pp 345-352, 
NBRF, Washington.
 4.Henikoff, S. and Henikoff, J.G. (1992). Proc. Natl. Acad. Sci. USA 89, 10915-
10919.
 5.Lipman, D.J., Altschul, S.F. and Kececioglu, J.D. (1989). Proc. Natl. Acad. Sci. 
USA 86, 4412-4415.
 6.Barton, G.J. and Sternberg, M.J.E. (1987). J. Mol. Biol. 198, 327-337.
 7.Gotoh, O. (1993). CABIOS 9, 361-370.
 8.Altschul, S.F. (1989). J. Theor. Biol. 138, 297-309.
 9.Lukashin, A.V., Engelbrecht, J. and Brunak, S. (1992). Nucl. Acids Res. 20, 
2511-2516.
10.Lawrence, C.E., Altschul, S.F., Boguski, M.S., Liu, J.S., Neuwald, A.F. and 
Wooton, J.C. (1993). Science, 262, 208-214.
11.Vingron, M. and Waterman, M.S. (1993). J. Mol. Biol. 234, 1-12.
12.Pascarella, S. and Argos, P. (1992). J. Mol. Biol. 224, 461-471.
13.Collins, J.F. and Coulson, A.F.W. (1987). In Nucleic acid and protein 
sequence analysis a practical approach, Bishop, M.J. and Rawlings, C.J. ed., 
chapter 13, pp. 323-358.
14.Vingron, M. and Sibbald, P.R. (1993). Proc. Natl. Acad. Sci. USA, 90, 8777-
8781.
15.Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994). CABIOS, 10, 19-29.
16.LŸthy, R., Xenarios, I. and Bucher, P. (1994). Protein Science, 3, 139-146.
17.Higgins, D.G. and Sharp, P.M. (1988). Gene, 73, 237-244.
18.Higgins, D.G. and Sharp, P.M. (1989). CABIOS, 5, 151-153.
19.Higgins, D.G., Bleasby, A.J. and Fuchs, R. (1992). CABIOS, 8, 189-191.
20.Sneath, P.H.A. and Sokal, R.R. (1973). Numerical Taxonomy, W.H. 
Freeman, San Francisco.
21.Saitou, N. and Nei, M. (1987). Mol. Biol. Evol. 4, 406-425.
22.Wilbur, W.J. and Lipman, D.J. (1983). Proc. Natl. Acad. Sci. USA, 80, 726-
730.
23.Musacchio, A., Gibson, T., Lehto, V.-P. and Saraste, M. (1992). FEBS Lett. 
307, 55-61.
24.Musacchio, A., Noble, M., Pauptit, R., Wierenga, R. and Saraste, M. (1992). 
Nature, 359, 851-855.
25.Bashford, D., Chothia, C. and Lesk, A.M. (1987). J. Mol. Biol. 196, 199-216.
26.Myers, E.W. and Miller, W. (1988). CABIOS, 4, 11-17.
27.Thompson, J.D. (1994). CABIOS, (Submitted).
28.Smith, T.F., Waterman, M.S. and Fitch, W.M. (1981). J. Mol. Evol. 18, 38-46.
29.Pearson, W.R. and Lipman, D.J. (1988). Proc. Natl. Acad. Sci. USA. 85, 2444-
2448.
30.Devereux, J., Haeberli, P. and Smithies, O. (1984). Nucleic Acids Res. 12, 
387-395.
31.Felsenstein, J. (1989). Cladistics 5, 164-166.
32.Kimura, M. (1980). J. Mol. Evol. 16, 111-120.
33.Kimura, M. (1983). The Neutral Theory of Molecular Evolution.  
Cambridge University Press, Cambridge.
34.Felsenstein, J. (1985). Evolution 39, 783-791.
35.Smith, R.F. and Smith, T.F. (1992) Protein Engineering 5, 35-41.
36.Krogh, A., Brown, M., Mian, S., Sjšlander, K. and Haussler, D. (1994) J. Mol. 
Biol. 235-1501-1531.
37.Jones, D.T., Taylor, W.R. and Thornton, J.M.  (1994). FEBS Lett. 339, 269-275.
38.Bairoch, A. and Bšckmann, B. (1992) Nucleic Acids Res., 20, 2019-2022.
39.Noble, M.E.M., Musacchio, A., Saraste, M., Courtneidge, S.A. and 
Wierenga, R.K. (1993) EMBO J. 12, 2617-2624.
40.Kabsch, W. and Sander, C. (1983) Biopolymers, 22, 2577-2637.

FIGURE LEGENDS

Figure 1.  The basic progressive alignment procedure, illustrated using a set of 
7 globins of known tertiary structure.  The sequence names are from Swiss 
Prot (38):  Hba_Horse: horse alpha globin; Hba_Human: human alpha globin; 
Hbb_Horse: horse beta globin; Hbb_Human: human beta globin; Myg_Phyca: 
sperm whale myoglobin; Glb5_Petma: lamprey cyanohaemoglobin; 
Lgb2_Luplu: lupin leghaemoglobin.   In the distance matrix, the mean 
number of differences per residue is given.  The unrooted tree shows all 
branch lengths drawn to scale.  In the rooted tree, all branch lengths (mean 
number of differences per residue along each branch) are given as well as 
weights for each sequence.  In the multiple alignment, the approximate 
positions of the 7 alpha helices, common to all 7 proteins are shown.  This 
alignment was derived using CLUSTAL W with default parameters and the 
PAM (3) series of weight matrices.  

Figure 2.  The scoring scheme for comparing two positions from two 
alignments.   Two sections of alignment with 4 and 2 sequences respectively 
are shown.   The score of the position with amino acids T,L,K,K versus the 
position with amino acids V and I is given with and without sequence 
weights.  M(X,Y) is the weight matrix entry for amino acid X versus amino 
acid Y.  Wn is the weight for sequence n.

Figure 3.  The variation in local gap opening penalty is plotted for a section of 
alignment.  The inital gap opening penalty is indicated by a dotted line. Two 
hydrophilic stretches are underlined.  The lowest penalties correspond to the 
ends of the alignment, the hydrophilic stretches and the two positions with 
gaps.   The highest values are within 8 residues of the two gap positions.  The 
rest of the variation is caused by the residue specific gap penalties (12).

Figure 4.  CLUSTAL W Alignment of a set of SH3 domains taken from (23). 
Secondary structure assignments for the solved Spectrin (24) and Fyn (39) 
domains are according to DSSP (40). The alignment was generated in two 
steps using default parameters. After full multiple alignment, the aligned 
sequences were realigned. Segments which were correctly aligned in the 
second pass are underlined. The single misaligned segment in H_P55 and the 
misaligned residue in H_NCK/2 are boxed.

The sequences are coloured to illustrate significant features. All G (orange) 
and P (yellow) are coloured. Other residues matching a frequent occurrence of 
a property in a column are coloured: hydrophobic = blue; hydrophobic 
tendency = light blue; basic = red; acidic = purple; hydrophilic = green; White 
= unconserved. The alignment figure was prepared with the GDE sequence 
editor (S. Smith, Harvard University) and COLORMASK (J. Thompson, 
EMBL).




Table 1.  Pascarella and Argos residue specific gap modification factors.   
-----------------------------------------------------------------------------------
A	1.13		M	1.29
C	1.13		N	0.63
D	0.96		P	0.74
E	1.31		Q	1.07
F	1.20		R	0.72
G	0.61		S	0.76
H	1.00		T	0.89
I	1.32		V	1.25
K	0.96		Y	1.00
L	1.21		W	1.23
-----------------------------------------------------------------------------------
The values are normalised around a mean value of 1.0 for H.  The lower the 
value, the greater the chance of having an adjacent gap.  These are derived 
from the original table of relative frequencies of gaps adjacent to each residue 
(12) by subtraction from 2.0.


ly correct but some of the 
sequences are locally misaligned, this works well.  


Comparison with other methods

Recently, several papers have addressed the problem of position specific 
parameters for multiple alignment.  In one case (35), local gap penalties are 
increased in alpha helical and beta strand regions, when the 3-D structures of 
one or more of the sequences are known.  In a second case (36), a hidden 
Markov model was used to estimate position specific gap penClustalW1.7/clustalv.html010064400063530000012000001656440636737120400166420ustar00pingouinstaff00000400000036

ClustalV 


Clustal V Multiple Sequence Alignments.


OVERVIEW

This document describes how to install and use ClustalV on various machines. ClustalV is a complete upgrade and rewrite of the Clustal package of multiple alignment programs (Higgins and Sharp, 1988 and 1989). The original programs were written in Fortran for microcomputers running MSDOS. You carried out a complete alignment by running 3 programs in succession. Later, these were merged into a single menu driven program with on-line help, for VAX/VMS. ClustalV was written in C and has all of the features of the old programs plus many new ones. It has been compiled and tested using VAX/VMS C, Decstation ULTRIX C, Gnu C for Sun workstations, Turbo C for IBM PC's and Think C for Apple Mac's. The original Clustal was written by Des Higgins while he was a Post-Doc in the lab of Paul Sharp in the Genetics Department, Trinity College, Dublin 2, Ireland.

The main feature of the old package was the ability to carry out reliable multiple alignments of many sequences. The sensitivity of the program is as good as from any other program we have tried, with the exception of the programs of Vingron and Argos (1991), while it works in reasonable time on a microcomputer. The programs of Vingron and Argos are specialised for finding distant similarities between proteins but require mainframes or workstations and are more difficult to use.

The main new features are: profile alignments (alignments of old alignments); phylogenetic trees (Neighbor Joining trees calculated after multiple alignment with a bootstrapping option); better sequence input (automatically recognise and read NBRF/PIR, Pearson (Fasta) or EMBL/SwissProt formats); flexible alignment output (choose one of: old Clustal format, NBRF/PIR, GCG msf format or Phylip format); full command line interface (everything that you can do interactively can be specified on the command line).

In version 7 of the GCG package, there is a program called PILEUP which uses a very similar algorithm to the one in ClustalV. There are 2 main differences between the programs: 1) the metric used to compare the sequences for the initial "guide tree" uses a full global, optimal alignment in PILEUP instead of the fast, approximate ones in ClustalV. This makes PILEUP much slower for the comparison of long sequences. In principle, the distances calculated from PILEUP will be more sensitive than ours, but in practice it will not make much difference, except in difficult cases. 2) During the multiple alignment, terminal gaps are penalised in ClustalV but not in PILEUP. This will make the PILEUP alignments better when the sequences are of very different lengths (has no effect if there are no large terminal gaps).


This software may be distributed and used freely, provided that you do not modify it or this documentation in any way without the permission of the authors.

If you wish to refer to ClustalV, please cite: Higgins,D.G. Bleasby,A.J. and Fuchs,R. (1991) CLUSTAL V: improved software for multiple sequence alignment. CABIOS, vol .8, 189-191.

The overall multiple alignment algorithm was described in: Higgins,D.G. and Sharp,P.M. (1989). Fast and sensitive multiple sequence alignments on a microcomputer. CABIOS, vol. 5, 151-153.


ACKNOWLEDGEMENTS
D.H. would particularly like to thank Paul Sharp, in whose lab. this work originated. We also thank Manolo Gouy, Gene Myers, Peter Rice and Martin Vingron for suggestions, bug-fixes and help.
Des Higgins and Rainer Fuchs, EMBL Data Library, Heidelberg, Germany.
Alan Bleasby, Daresbury, UK.

JUNE 1991


INSTALLATION. (DELETED)


INTERACTIVE USAGE.

Interactive usage of Clustal V is completely menu driven. On-line help is provided, defaults are offered for all parameters and file names. With a little effort it should be completely self explanatory. The main menu, which appears when you run the programs is shown below. Each item brings you to a sub menu.

 

 Main menu for Clustal V:


 1. Sequence Input From Disc
 2. Multiple Alignments
 3. Profile Alignments
 4. Phylogenetic trees

 S. Execute a system command
 H. HELP
 X. EXIT (leave program)


 Your choice: 


The options S and H appear on all the main menus. H will provide help and if you type S you will be asked to enter a command, such as DIR or LS, which will be sent to the system (does not work on Mac's). Before carrying out an alignment, you must use option 1 (sequence input); the format for sequences is explained below. Under menu item 2 you will be able to automatically align your sequences to each other. Menu item 3 allows you to do profile alignments. These are alignments of old alignments. This allows you to build up a multiple alignment in stages or add a new sequence to an old alignment. You can calculate phylogenetic trees from alignments using menu item 4.



* MULTIPLE ALIGNMENT MENU. *

The multiple alignment menu is shown below. Before explaining how to use it, you must be introduced briefly to the alignment strategy. If you do not follow this, try using option 1 anyway; the entire process will be carried out automatically.

To do a complete multiple alignment, we need to know the approximate relationships of the sequences to each other (which ones are most similar to each other). We do this by calculating a crude phylogenetic tree which we call a dendrogram (to distinguish it from the more sensitive trees available under the phylogenetic tree menu). This dendrogram is used as a guide to align bigger and bigger groups of sequences during the multiple alignment. The dendrogram is calculated in 2 stages: 1) all pairs of sequence are compared using the fast/approximate method of Wilbur and Lipman (1983); the result of each comparison is a similarity score. 2) the similarity scores are used to construct the dendrogram using the UPGMA cluster analysis method of Sneath and Sokal (1973).

The construction of the dendrogram can be very time consuming if you wish to align many sequences (e.g. for 100 sequences you need to carry out 100x99/2 sequence comparisons = 4950). During every multiple alignment, a dendrogram is constructed and saved to a file (something.dnd). These can be reused later.


 
 ******Multiple*Alignment*Menu******


1.  Do complete multiple alignment now
2.  Produce dendrogram file only
3.  Use old dendrogram file
4.  Pairwise alignment parameters
5.  Multiple alignment parameters
6.  Output format options

S.  Execute a system command
H.  HELP
or press [RETURN] to go back to main menu


 Your choice: 

So, if in doubt, and you have already loaded some sequences from the main menu, just try option 1 and press the <Return> key in response to any questions. You will be prompted for 2 file names e.g. if the sequence input file was called DRINK.PEP, you will be offered DRINK.ALN as the file to contain the alignment and DRINK.DND for the dendrogram.

If you wish to repeat a multiple alignment (e.g. to experiment with different gap penalties) but do not wish to make a dendrogram all over again use menu item 3 (providing you are using the same sequences). Similarly, menu item 2 allows you to produce the dendrogram file only.


PAIRWISE ALIGNMENT PARAMETERS:

The parameters that control the initial fast/approximate comparisons can be set from menu item 4 which looks like:

 
  ********* WILBUR/LIPMAN PAIRWISE ALIGNMENT PARAMETERS *********


 1. Toggle Scoring Method  :Percentage
 2. Gap Penalty            :3
 3. K-tuple                :1
 4. No. of top diagonals   :5
 5. Window size            :5

 H. HELP


 Enter number (or [RETURN] to exit): 



The similarity scores are calculated from fast alignments generated by the method of Wilbur and Lipman (1983). These are 'hash' or 'word' or 'k-tuple' alignments carried out in 3 stages.

First you mark the positions of every fragment of sequence, K-tuple long (for proteins, the default length is 1 residue, for DNA it is 2 bases) in both sequences. Then you locate all k-tuple matches between the 2 sequences. At this stage you have to imagine a dot- matrix plot between the 2 sequences with each k-tuple match as a dot. You find those diagonals in the plot with most matches (you take the "No. of top diagonals" best ones) and mark all diagonals within Window size" of each top diagonal. This process will define diagonal bands in the plot where you hope the most likely regions of similarity will lie.

The final alignment stage is to find that head to tail arrangement of k-tuple matches from these diagonal regions that will give the highest score. The score is calculated as the number of exactly matching residues in this alignment minus a "gap penalty" for every gap that was introduced. When you toggle "Scoring method" you choose between expressing these similarity scores as raw scores or expressed as a percentage of the shorter sequence length.

K-TUPLE SIZE: Can be 1 or 2 for proteins; 1 to 4 for DNA. Increase this to increase speed; decrease to improve sensitivity.

GAP PENALTY: The number of matching residues that must be found in order to introduce a gap. This should be larger than K-Tuple Size. This has little effect on speed or sensitivity.

NO. OF TOP DIAGONALS: The number of best diagonals in the imaginary dot-matrix plot that are considered. Decrease (must be greater than zero) to increase speed; increase to improve sensitivity.

WINDOW SIZE: The number of diagonals around each "top" diagonal that are considered. Decrease for speed; increase for greater sensitivity.

SCORING METHOD: The similarity scores may be expressed as raw scores (number of identical residues minus a "gap penalty" for each gap) or as percentage scores. If the sequences are of very different lengths, percentage scores make more sense.


CHANGING THE PAIRWISE ALIGNMENT PARAMETERS
The main reason for wanting to change the above parameters is SPEED (especially on microcomputers), NOT SENSITIVITY. The dendrograms that are produced can only show the relationships between the sequences APPROXIMATELY because the similarity scores are calculated from seperate pairwise alignments; not from a multiple alignment (that is what we eventually hope to produce). If the groupings of the sequences are "obvious", the above method should work well; if the relationships are obscure or weakly represented by the data, it will not make much difference playing with the parameters. The main factor influencing speed is the K-TUPLE SIZE followed by the WINDOW SIZE.

The alignments are carried out in a small amount of memory. Occasionally (it is hard to predict), you will run out of memory while doing these alignments; when this happens, it will say on the screen: "Sequences (a,b) partially aligned" (instead of "Sequences (a,b) aligned"). This means that the alignment score for these sequences will be approximate; it is not a problem unless many of the alignments do this. It can be fixed by using less sensitive parameters or increasing parameter FSIZE in clustalv.h .



THE DENDROGRAM ITSELF

The similarity scores generated by the fast comparison of all the sequences are used to construct a dendrogram by the UPGMA method of Sneath and Sokal (1973). This is a form of cluster analysis and the end result produces something that looks like a tree. It represents the similarity of the sequences as a hierarchy. The dendrogram is written to a file in a machine readable format and is ahown below for an example with 6 sequences.



91.0   0   0   2   012000         ! seq 2 joins seq 3 at 91% ID.
72.0   1   0   3   011200         ! seq 4 joins seqs 2,3 at 72%
71.1   0   0   2   000012         ! seq 5 joins seq 6 at 71%
35.5   0   2   4   122200         ! seq 1 joins seqs 2,3,4
21.7   4   3   6   111122         ! seqs 1,2,3,4 join seqs 5,6

This LOOKS complicated but you do not normally need to care what is in here. Anyway, each row represents the joining together of 2 or more sequences. You progress from the top down, joining more and more sequences until all are joined together; for N sequences you have N-1 groupings hence there are 5 rows in the above file (there were 6 sequences). In each row, the first number is the similarity score of this grouping; ignore the next three columns for the moment; the last 6 digits in the line show which sequences are grouped; there is one digit for each sequence (the first digit is for the first sequence). The rule is: in each row, all of the "1"s join all of the "2"s; the zero's do nothing.

Hence, in the first row, sequence 2 joins sequence 3 at a similarity level of 91% identity; next, sequence 4 joins the previous grouping of 2 plus 3 at a level of 72% etc. This is shown diagrammatically below. Before leaving the dendrogram format, the other 3 columns of numbers are: a pointer to the row from which the "1" sequences were last joined (or zero if only one of them); a pointer to the row in which the "2"s were last joined; the total number of sequences joined in this line.


 


                  I------ 2
           I------I
           I      I------ 3  Diagram of the sequence similarity 
      I----I
      I    I------------- 4  relationships shown in the above 
   I--I
   I  I------------------ 1  dendrogram file (branch lengths are
    ----I
   I       I------------- 5  not to scale).
   I-------I
           I------------- 6








MULTIPLE ALIGNMENT PARAMETERS:

Having calculated a dendrogram between a set of sequences, the final multiple alignment is carried out by a series of alignments of larger and larger groups of sequences. The order is determined by the dendrogram so that the most similar sequences get aligned first. Any gaps that are introduced in the early alignments are fixed. When two groups of sequences are aligned against each other, a full protein weight matrix (such as a Dayhoff PAM 250) is used. Two gap penalties are offered: a "FIXED" penalty for opening up a gap and a "FLOATING" penalty for extending a gap.

 

  ********* MULTIPLE ALIGNMENT PARAMETERS *********


 1. Fixed Gap Penalty       :10
 2. Floating Gap Penalty    :10
 3. Toggle Transitions (DNA):Weighted
 4. Protein weight matrix   :PAM 250

 H. HELP


 Enter number (or [RETURN] to exit): 


FIXED GAP PENALTY: Reduce this to encourage gaps of all sizes; increase it to discourage them. Terminal gaps are penalised same as all others. BEWARE of making this too small (approx 5 or so); if the penalty is too small, the program may prefer to align each sequence opposite one long gap.

FLOATING GAP PENALTY: Reduce this to encourage longer gaps; increase it to shorten them. Terminal gaps are penalised same as all others. BEWARE of making this too small (approx 5 or so); if the penalty is too small, the program may prefer to align each sequence opposite one long gap.


DNA TRANSITIONS = WEIGHTED or UNWEIGHTED: By default, transitions (A versus G; C versus T) are weighted more strongly than transversions (an A aligned with a G will be preferred to an A aligned with a C or a T). You can make all pairs of nucleotide equally weighted with this option.

PROTEIN WEIGHT MATRIX: For protein comparisons, a weight matrix is used to differentially weight different pairs of aligned amino acids. The default is the well known Dayhoff PAM 250 matrix. We also offer a PAM 100 matrix, an identity matrix (all weights are the same for exact matches) or allow you to give the name of a file with your own matrix. The weight matrices used by Clustal V are shown in full in the Algorithms and References section of this documentation.

If you input a matrix from a file, it must be in the following format. Use a 20x20 matrix only (entries for the 20 "normal" amino acids only; no ambiguity codes etc.). Input the lower left triangle of the matrix, INCLUDING the diagonal. The order of the amino acids (rows and columns) must be: CSTPAGNDEQHRKMILVFYW. The values can be in free format seperated by spaces (not commas). The PAM 250 matrix is shown below in this format.


   12 
    0  2 
   -2  1  3 
   -3  1  0  6 
   -2  1  1  1  2 
   -3  1  0 -1  1  5 
   -4  1  0 -1  0  0  2 
   -5  0  0 -1  0  1  2  4 
   -5  0  0 -1  0  0  1  3  4 
   -5 -1 -1  0  0 -1  1  2  2  4 
   -3 -1 -1  0 -1 -2  2  1  1  3  6 
   -4  0 -1  0 -2 -3  0 -1 -1  1  2  6 
   -5  0  0 -1 -1 -2  1  0  0  1  0  3  5 
   -5 -2 -1 -2 -1 -3 -2 -3 -2 -1 -2  0  0  6 
   -2 -1  0 -2 -1 -3 -2 -2 -2 -2 -2 -2 -2  2  5 
   -6 -3 -2 -3 -2 -4 -3 -4 -3 -2 -2 -3 -3  4  2  6 
   -2 -1  0 -1  0 -1 -2 -2 -2 -2 -2 -2 -2  2  4  2  4 
   -4 -3 -3 -5 -4 -5 -4 -6 -5 -5 -2 -4 -5  0  1  2 -1  9 
    0 -3 -3 -5 -3 -5 -2 -4 -4 -4  0 -4 -4 -2 -1 -1 -2  7 10 
   -8 -2 -5 -6 -6 -7 -4 -7 -7 -5 -3  2 -3 -4 -5 -2 -6  0  0 17 

Values must be integers and can be all positive or positive and negative as above. These are SIMILARITY values.





ALIGNMENT OUTPUT OPTIONS:

By default, the alignment goes to a file in a self explanatory "blocked" alignment format. This format is fine for displaying the results but requires heavy editing if you wish to use the alignment with other software. To help, we provide 3 other formats which can be turned on or off. If you have a sequence data set or alignment in memory, you can also ask for output files in whatever formats are turned on, NOW. The menu you use to choose format is shown below.

*** We draw your attention to NBRF/PIR format in particular. This format is EXACTLY the same as one of the input formats. Therefore, alignments written in this format can be used again as input (to the profile alignments or phylogenetic trees). ***

 

  ********* Format of Alignment Output *********


 1. Toggle CLUSTAL format output   =  ON
 2. Toggle NBRF/PIR format output  =  OFF
 3. Toggle GCG format output       =  OFF
 4. Toggle PHYLIP format output    =  OFF

 5. Create alignment output file(s) now?
 H. HELP


 Enter number (or [RETURN] to exit): 



CLUSTAL FORMAT: This is a self explanatory alignment. The alignment is written out in blocks. Identities are highlighted and (if you use a PAM 250 matrix) positions in the alignment where all of the residues are "similar" to each other (PAM 250 score of 8 or more) are indicated.

NBRF/PIR FORMAT: This is the usual NBRF/PIR format with gaps indicated by hyphens ("-"). AS we have stressed before, this format is EXACTLY compatible with the sequence input format. Therefore you can read in these alignments again for profile alignments or for calculating phylogenetic trees.

GCG FORMAT: In version 7 of the Wisconsin GCG package, a new multiple sequence format was introduced. This is the MSF (Multiple Sequence Format) format. It can be used as input to the GCG sequence editor or any of the GCG programs that make use of multiple alignments. THIS FORMAT IS ONLY SUPPORTED IN VERSION 7 OF THE GCG PACKAGE OR LATER.

PHYLIP FORMAT: This format can be used by the Phylip package of Joe Felsenstein (see the references/algorithms section for details of how to get it). Phylip allows you to do a huge range of phylogenetic analyses (we just offer one method in this program) and is probably the most widely used set of programs for drawing trees. It also works on just about every computer you can think of, providing you have a decent Pascal compiler.





* PROFILE ALIGNMENT MENU. *


This menu is for taking two old alignments (or single sequences) and aligning them with each other. The result is one bigger alignment. The menu is very similar to the multiple alignment menu except that there is no mention of dendrograms here (they are not needed) and you need to input two sets of sequences. The menu looks like this:

 

 ******Profile*Alignment*Menu******


1.  Input 1st. profile/sequence
2.  Input 2nd. profile/sequence
3.  Do alignment now
4.  Alignment parameters
5.  Output format options

S.  Execute a system command
H.  HELP
or press [RETURN] to go back to main menu


 Your choice: 


You must input profile number 1 first. When both profiles are loaded, use item 3 (Do alignment now) and the 2 profiles will be aligned. Items 4 and 5 (parameters and output options) are identical to the equivalent options on the multiple alignment menu.

The same input routines that are used for general input are used here i.e. sequences must be in NBRF/PIR, EMBL/SwissProt or FASTA format, with gaps indicated by hyphens ("-"). This is why we have continualy drawn your attention to the NBRF/PIR format as a useful output format.

Either profile can consist of just one sequence. Therefore, if you have a favourite alignment of sequences that you are working on and wish to add a new sequence, you can use this menu, provided the alignment is in the correct format.

The total number of sequences in the two profiles must be less less than or equal to the MAXN parameter set in the clustalv.h header file.



* PHYLOGENETIC TREE MENU. *


This menu allows you to input an alignment and calculate a phylogenetic tree. You can also calculate a tree if you have just carried out a multiple alignment and the alignment is still in memory. THE SEQUENCES MUST BE ALIGNED ALREADY!!!!!! The tree will look strange if the sequences are not already aligned. You can also "BOOTSTRAP" the tree to show confidence levels for groupings. This is SLOW on microcomputers but works fine on workstations or mainframes.

 

 ******Phylogenetic*tree*Menu******


1.  Input an alignment
2.  Exclude positions with gaps?        = OFF
3.  Correct for multiple substitutions? = OFF
4.  Draw tree now
5.  Bootstrap tree

S.  Execute a system command
H.  HELP
or press [RETURN] to go back to main menu


 Your choice: 



The same input routine that is used for general input is used here i.e. sequences must be in NBRF/PIR, EMBL/SwissProt or FASTA format, with gaps indicated by hyphens ("-"). This is why we have continualy drawn your attention to the NBRF/PIR format as a useful output format.

If you have input an alignment, then just use item 4 to draw a tree. The method used is the Neighbor Joining method of Saitou and Nei (1987). This is a "distance method". First, percent divergence figures are calculated between all pairs of sequence. These divergence figures are then used by the NJ method to give the tree. Example trees will be shown below.

There are two options which can be used to control the way the distances are calculated. These are set by options 2 and 3 in the menu.

EXCLUDE POSITIONS WITH GAPS? This option allows you to ignore all alignment positions (columns) where there is a gap in ANY sequence. This guarantees that "like" is compared with "like" in all distances i.e. the same positions are used to calculate all distances. It also means that the distances will be "metric". The disadvantage of using this option is that you throw away much of the data if there are many gaps. If the total number of gaps is small, it has little effect.

CORRECT FOR MULTIPLE SUBSTITUTIONS? As sequences diverge, substitutions accumulate. It becomes increasingly likely that more than one substitution (as a result of a mutation) will have happened at a site where you observe just one difference now. This option allows you to use formulae developed by Motoo Kimura to correct for this effect. It has the effect of stretching long branches in tres while leaving short ones relatively untouched. The desired effect is to try and make distances proportional to time since divergence.

The tree is sent to a file called BLAH.NJ, where BLAH.SEQ is the name of the input, alignment file. An example is shown below for 6 globin sequences.

 

  DIST   = percentage divergence (/100)
  Length = number of sites used in comparison

    1 vs.   2  DIST = 0.5683;  length =    139
    1 vs.   3  DIST = 0.5540;  length =    139
    1 vs.   4  DIST = 0.5315;  length =    111
    1 vs.   5  DIST = 0.7447;  length =    141
    1 vs.   6  DIST = 0.7571;  length =    140
    2 vs.   3  DIST = 0.0897;  length =    145
    2 vs.   4  DIST = 0.1391;  length =    115
    2 vs.   5  DIST = 0.7517;  length =    145
    2 vs.   6  DIST = 0.7431;  length =    144
    3 vs.   4  DIST = 0.0957;  length =    115
    3 vs.   5  DIST = 0.7379;  length =    145
    3 vs.   6  DIST = 0.7361;  length =    144
    4 vs.   5  DIST = 0.7304;  length =    115
    4 vs.   6  DIST = 0.7368;  length =    114
    5 vs.   6  DIST = 0.2697;  length =    152


 			Neighbor-joining Method

  Saitou, N. and Nei, M. (1987) The Neighbor-joining Method:
  A New Method for Reconstructing Phylogenetic Trees.
  Mol. Biol. Evol., 4(4), 406-425


  This is an UNROOTED tree

  Numbers in parentheses are branch lengths


  Cycle   1     =  SEQ:   5 (  0.13382) joins  SEQ:   6 (  0.13592)

  Cycle   2     =  SEQ:   1 (  0.28142) joins Node:   5 (  0.33462)

  Cycle   3     =  SEQ:   2 (  0.05879) joins  SEQ:   3 (  0.03086)

  Cycle   4 (Last cycle, trichotomy):

 		 Node:   1 (  0.20798) joins
 		 Node:   2 (  0.02341) joins
SEQ:   4 (  0.04915) 



The output file first shows the percent divergence (distance) figures between each pair of sequence. Then a description of a NJ tree is given. This description shows which sequences (SEQ:) or which groups of sequences (NODE: , a node is numbered using the lowest sequence that belongs to it) join at each level of the tree.

This is an unrooted tree!! This means that the direction of evolution through the tree is not shown. This can only be inferred in one of two ways: 1) assume a degree of constancy in the molecular clock and place the root (bottom of the tree; the point where all the sequences radiate from) half way along the longest branch. **OR** 2) use an "outgroup", a sequence from an organism that you "know" must be outside of the rest of the sequences i.e. root the tree manually, on biological grounds.

The above tree can be represented diagramatically as follows:


 
                               SEQ 1       SEQ 4
                                I           I
               13.6             I 28.1      I 4.9          5.9
       SEQ 6 ----------I        I           I          I--------- SEQ 2
                       I        I           I          I
                       I--------I-----------I----------I
               13.4    I  33.5      20.8         2.3   I   3.1
       SEQ 5 ----------I                               I--------- SEQ 3


The figures along each branch are percent divergences along that branch. If you root the tree by placing the root along the longest branch (33.5%) then you can draw it again as follows, this time rooted:


 

                         13.6
                 I-------------------- SEQ 6
       I---------I       13.4
       I         I-------------------- SEQ 5
       I 33.5 
  -----I                 28.1
       I         I-------------------- SEQ 1
       I         I
       I---------I                4.9
                 I  20.8  I----------- SEQ 4
                 I--------I  
                          I       5.9
                          I 2.3 I----- SEQ 2
                          I-----I 3.1
                                I----- SEQ 3



The longest branch (33.5% between 5,6 and 1,2,3,4) is split between the 2 bottom branches of the tree. As it happens in this particular case, sequences 5 and 6 are myoglobins while sequences 1,2,3 and 4 are alpha and beta globins, so you could also justify the above rooting on biological grounds. If you do not have any particular need or evidence for the position of the root, then LEAVE THE TREE UNROOTED. Unrooted trees do not look as pretty as rooted ones but it is uaual to leave them unrooted if you do not have any evidence for the position of the root.


BOTSTRAPPING: Different sets of sequences and different tree drawing methods may give different topologies (branching orders) for parts of a tree that are weakly supported by the data. It is useful to have an indication of the degree of error in the tree. There are several ways of doing this, some of them rather technical. We provide one general purpose method in this program, which makes use of a technique called bootstrapping (see Felsenstein, 1985).

In the case of sequence alignments, bootstrapping involves taking random samples of positions from the alignment. If the alignment has N positions, each bootstrap sample consists of a random sample of N positions, taken WITH REPLACEMENT i.e. in any given sample, some sites may be sampled several times, others not at all. Then, with each sample of sites, you calculate a distance matrix as usual and draw a tree. If the data very strongly support just one tree then the sample trees will be very similar to each other and to the original tree, drawn without bootstrapping. However, if parts of the tree are not well supported, then the sample trees will vary considerably in how they represent these parts.

In practice, you should use a very large number of bootstrap replicates (1000 is recommended, even if it means running the program for an hour on a slow microcomputer; on a workstation it will be MUCH faster). For each grouping on the tree, you record the number of times this grouping occurs in the sample trees. For a group to be considered "significant" at the 95% level (or P <= 0.05 in statistical terms) you expect the grouping to show up in >= 95% of the sample trees. If this happens, then you can say that the grouping is significant, given the data set and the method used to draw the tree.

So, when you use the bootstrap option, a NJ tree is drawn as before and then you are asked to say how many bootstrap samples you want (1000 is the default) and you are asked to give a seed number for the random number generator. If you give the same seed number in future, you will get the same results (we hope). Remember to give different seed numbers if you wish to carry out genuinely different bootstrap sampling experiments. Below is the output file from using the same data for the 6 globin sequences as used before. The output file has the same name as the input fike with the extension ".njb".

 
 //
 STUFF DELETED  .... same as for the ordinary NJ output
 //
 			Bootstrap Confidence Limits


  Random number generator seed =      99

  Number of bootstrap trials   =    1000


  Diagrammatic representation of the above tree: 

  Each row represents 1 tree cycle; defining 2 groups.

  Each column is 1 sequence; the stars in each line show 1 group; 
  the dots show the other

  Numbers show occurences in bootstrap samples.

 ****..   1000              
 .***..   1000                <- This is the answer!!
 *..***    812 
 122311


For an unrooted tree with N sequences, there are actually only N-3 genuinely different groupings that we can test (this is the number of "internal branches"; each internal branch splits the sequences into 2 groups). In this example, we have 6 sequences with 3 internal branches in the reference tree. In the bootstrap resampling, we count how often each of these internal branches occur. Here, we find that the branch which splits 1,2,3 and 4 versus 5 and 6 occurs in all 1000 samples; the branch which splits 2,3 and 4 versus 1,5 and 6 occurs in 1000; the branch which splits 2 and 3 versus 1,4,5 and 6 occurs in 812/1000 samples. We can put these figures on to the diagrammatic representation we made earlier of our unrooted NJ tree as follows:


 

                               SEQ 1       SEQ 4
                                I           I
                                I           I            
       SEQ 6 ----------I        I           I          I--------- SEQ 2
                       I  1000  I   1000    I   812    I
                       I--------I-----------I----------I
                       I                               I    
       SEQ 5 ----------I                               I--------- SEQ 3



You can equally put these confidence figures on the rooted tree (in fact the interpretation is simpler with rooted trees). With the unrooted tree, the grouping of sequence 5 with 6 is significant (as is the grouping of sequences 1,2,3 and 4). Equally the grouping of sequences 1,5 and 6 is significant (the same as saying that 2,3 and 4 group significantly). However, the grouping of 2 and 3 is not significant, although it is relatively strongly supported.

Unfortunately, there is a small complication in the interpretation of these results. In statistical hypothesis testing, it is not valid to make multiple simultaneous tests and to treat the result of each test completely independantly. In the above case, if you have one particular test (grouping) that you wish to make in advance, it is valid to test IT ALONE and to simply show the other bootstrap figures for reference. If you do not have any particular test in mind before you do the bootstrapping, you can just show all of the figures and use the 95% level as an ARBITRARY cut off to show those groups that are very strongly supported; but not mention anything about SIGNIFICANCE testing. In the literature, it is common practice to simply show the figures with a tree; they frequently speak for themselves.



ALGORITHMS AND REFERENCES.

In this section, we will try to BRIEFLY describe the algorithms used in ClustalV and give references.

MULTIPLE ALIGNMENTS.

The approach used in ClustalV is a modified version of the method of Feng and Doolittle (1987) who aligned the sequences in larger and larger groups according to the branching order in an initial phylogenetic tree. This approach allows a very useful combination of computational tractability and sensitivity.

The positions of gaps that are generated in early alignments remain through later stages. This can be justified because gaps that arise from the comparison of closely related sequences should not be moved because of later alignment with more distantly related sequences. At each alignment stage, you align two groups of already aligned sequences. This is done using a dynamic programming algorithm where one allows the residues that occur in every sequence at each alignment position to contribute to the alignment score. A Dayhoff (1978) PAM matrix is used in protein comparisons.

The details of the algorithm used in ClustalV have been published in Higgins and Sharp (1989). This was an improved version of an earlier algorithm published in Higgins and Sharp (1988). First, you calculate a crude similarity measure between every pair of sequence. This is done using the fast, approximate alignment algorithm of Wilbur and Lipman (1983). Then, these scores are used to calculate a "guide tree" or dendrogram, which will tell the multiple alignment stage in which order to align the sequences for the final multiple alignment. This "guide tree" is calculated using the UPGMA method of Sneath and Sokal (1973). UPGMA is a fancy name for one type of average linkage cluster analysis, invented by Sokal and Michener (1958).

Having calculated the dendrogram, the sequences are aligned in larger and larger groups. At each alignment stage, we use the algorithm of Myers and Miller (1988) for the optimal alignments. This algorithm is a very memory efficient variation of Gotoh's algorithm (Gotoh, 1982). It is because of this algorithm that ClustalV can work on microcomputers. Each of these alignments consists of aligning 2 alignments, using what we call "profile alignments".



PROFILE ALIGNMENTS.

We use the term "profile alignment" to describe the alignment of 2 alignments. We use this term because the method is a simple extension of the profile method of Gribskov, et al. (1987) for aligning 1 sequence with an alignment. Normally, with a 2 sequence alignment, you use a weight matrix (e.g. a PAM 250 matrix) to give a score between the pairs of aligned residues. The alignment is considered "optimal" if it gives the best total score for aligned residues minus penalties for any gaps (insertions or deletions) that must be introduced.

Profile alignments are a simple extension of 2 sequence alignments in that you can treat each of the two input alignments as single sequences but you calculate the score at aligned positions as the average weight matrix score of all the residues in one alignment versus all those in the other e.g. if you have 2 alignments with I and J sequences respectively; the score at any position is the average of all the I times J scores of the residues compared seperately. Any gaps that are introduced are placed in all of the sequences of an alignment at the same position. The profile alignments offered in the "profile alignment menu" are also calculated in this way.



PROTEIN WEIGHT MATRICES.

There are 3 built-in weight matrices used by clustalV. These are the PAM 100 and PAM 250 matrices of Dayhoff (1978) and an identity matrix. Each matrix is given as the bottom left half, including the diagonal of a 20 by 20 matrix. The order of the rows and columns is CSTPAGNDEQHRKMILVFYW.


 
 PAM 250

 C  12 
 S   0  2 
 T  -2  1  3 
 P  -3  1  0  6 
 A  -2  1  1  1  2 
 G  -3  1  0 -1  1  5 
 N  -4  1  0 -1  0  0  2 
 D  -5  0  0 -1  0  1  2  4 
 E  -5  0  0 -1  0  0  1  3  4 
 Q  -5 -1 -1  0  0 -1  1  2  2  4 
 H  -3 -1 -1  0 -1 -2  2  1  1  3  6 
 R  -4  0 -1  0 -2 -3  0 -1 -1  1  2  6 
 K  -5  0  0 -1 -1 -2  1  0  0  1  0  3  5 
 M  -5 -2 -1 -2 -1 -3 -2 -3 -2 -1 -2  0  0  6 
 I  -2 -1  0 -2 -1 -3 -2 -2 -2 -2 -2 -2 -2  2  5 
 L  -6 -3 -2 -3 -2 -4 -3 -4 -3 -2 -2 -3 -3  4  2  6 
 V  -2 -1  0 -1  0 -1 -2 -2 -2 -2 -2 -2 -2  2  4  2  4 
 F  -4 -3 -3 -5 -4 -5 -4 -6 -5 -5 -2 -4 -5  0  1  2 -1  9 
 Y   0 -3 -3 -5 -3 -5 -2 -4 -4 -4  0 -4 -4 -2 -1 -1 -2  7 10 
 W  -8 -2 -5 -6 -6 -7 -4 -7 -7 -5 -3  2 -3 -4 -5 -2 -6  0  0 17 
 ---------------------------------------------------------------- 
C  S  T  P  A  G  N  D  E  Q  H  R  K  M  I  L  V  F  Y  W


 IDENTITY MATRIX

 10 
  0  10 
  0  0  10 
  0  0  0  10 
  0  0  0  0  10 
  0  0  0  0  1  10 
  0  0  0  0  0  0  10 
  0  0  0  0  0  0  0  10 
  0  0  0  0  0  0  0  0  10 
  0  0  0  0  0  0  0  0  0  10 
  0  0  0  0  0  0  0  0  0  0  10 
  0  0  0  0  0  0  0  0  0  0  0  10 
  0  0  0  0  0  0  0  0  0  0  0  0  10 
  0  0  0  0  0  0  0  0  0  0  0  0  0  10 
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  10 
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  10 
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  10 
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  10 
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 10 
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 10





 PAM 100

  14 
  -1  6 
  -5  2   7 
  -6  1  -1  10 
  -5  2   2   1   6 
  -8  1  -3  -3   1   8 
  -8  2   0  -3  -1  -1  7 
 -11 -1  -2  -4  -1  -1  4   8 
 -11 -2  -3  -3   0  -2  1   5   8 
 -11 -3  -3  -1  -2  -5 -1   1   4   9 
  -6 -4  -5  -2  -5  -7  2  -1  -2   4 11 
  -6 -1  -4, -2  -5  -8 -3  -6  -5   1  1 10 
 -11 -2  -1  -4  -4  -5  1  -2  -2  -1 -3  3  8 
 -11 -4  -2  -6  -3  -8 -5  -8  -6  -2 -7 -2  1 13 
  -5 -4  -1  -6  -3  -7 -4  -6  -5  -5 -7 -4  4  2  9 
 -12 -7  -5  -5  -5  -8 -6  -9  -7  -3 -5 -7 -6  4  2  9 
  -4 -4  -1  -4   0  -4 -5  -6  -5  -5 -6 -6 -6  1  5  1  8 
 -10 -5  -6  -9  -7  -8 -6 -11 -11 -10 -4 -7-11 -2  0  0 -5 12 
  -2 -6  -6 -11  -6 -11 -3  -9  -7  -9 -1-10-10 -8 -4 -5 -6  6 13 
 -13 -4 -10 -11 -11 -13 -8 -13 -14 -11 -7  1 -9-11-12 -7-14 -2 -2 19 




PHYLOGENETIC TREES.

There are two COMMONLY used approaches for inferring phylogentic trees from sequence data: parsimony and distance methods. There are other approaches which are probably superior in theory but which are yet to be used widely. This does not mean that they are no use; we (the authors of this program at any rate) simply do not know enough about them yet. You should see the documentation accompanying the Phylip package and some of the references there for an explanation of the different methods and what assumptions are implied when you use them.

There is a constant debate in the literature as to the merits of different methods but unfortunately, a lot of what is said is incomprehensible or inaccurate. It is also a field that is prone to having highly opinionated schools of thought. This is a pity because it prevents rational discussion of the pro's and con's of the different methods. The approach adopted in ClustalV is to supply just one method and to produce alignments in a format that can be used by Phylip. In simple cases, the trees produced will be as "good" (reliable, robust) as those from ANY other method. In more complicated cases, there is no single magic recipe that we can supply that will work well in even most situations.

The method we provide is the Neighbor Joining method (NJ) of Saitou and Nei (1987) which is a distance method. We use this for three reasons: it is conceptually and computationally simple; it is fast; it gives "good" trees in simple cases. It is difficult to prove that one tree is "better" than another if you do not know the true phylogeny; the few systematic surveys of methods show it to work more or less as well as any other method ON AVERAGE. Another reason for using the NJ method is that it is very commonly used; THIS IS A BAD REASON SCIENTIFICALLY but at least you will not feel lonely if you use it.

The NJ method works on a matrix of distances (the distance matrix) between all pairs of sequence to be analysed. These distances are related to the degree of divergence between the sequences. It is normal to calculate the distances from the sequences after they are multiply aligned. If you calculate them from seperate alignments (as done for the dendrograms in another part of this program), you may increase the error considerably.



DISTANCES

The simplest measure of distance between sequences is percent divergence (100% minus percent identity). For two sequences, you count how many positions differ between them (ignoring all positions with a gap or an unknown residue) and divide by the number of positions considered. It is common practice to also ignore all positions in the alignment where there is a GAP in ANY of the sequences (Tossgaps ? option in the menu). Usually, you express the percent distance divided by 100 (gives distances between 0.0 and 1.0).

This measure of distance is perfectly adequate (with some further modification described below) for rRNA sequences. However it treats all residues identically e.g. all amino acid substitutions are equally weighted. It also treats all positions identically e.g. it does not take account of different rates of substitution in different positions of different codons in protein coding DNA sequences; see Li et al (1985) for a distance measure that does. Despite these shortcomings, these percent identity distances do work well in practice in a wide variety of situations.

In a simple world, you would like a distance to be proportional to the time since the sequences diverged. If this were EXACTLY true, then the calculation of the tree would be a simple matter of algebra (UPGMA does this for you) and the branch lengths will be nice and meaningful (times). In practice this OBVIOUSLY depends on the existence and quality of the "molecular clock", a subject of on- going debate. However, even if there is a good clock, there is a further problem with estimating divergences. As sequences diverge, they become "saturated" with mutations. Sites can have substitutions more than once. Calculated distances will underestimate actual divergence times; the greater the divergence, the greater the discrepancy. There are various methods for dealing with this and we provide two commonly used ones, both due to Motoo Kimura; one for proteins and one for DNA.


For distance K (percent divergence /100 ) ...

Correction for Protein distances: (Kimura, 1983).


   Corrected K = -ln(1.0 - K - (K * k/5.0))


Correction for nucleotide distances: Kimura's 2-parameter method (Kimura, 1980).


   Corrected K = 0.5*ln(a) + 0.25*ln(b)

   where     a = 1/(1 - 2*P - Q)
   and       b = 1/(1 - 2*Q)

P and Q are the proportions of transitions (A<-->G, C<-->T) and transversions occuring between the sequences.


One paradoxical effect of these corrections, is that distances can be corrected to have more than 100% divergence. That is because, for very highly diverged sequences of length N, you can estimate that more than N substitutions have occured by correcting the observed distance in the above ways. Don't panic!




NEIGHBOR JOINING TREES.

VERY briefly, the NJ method works as follows. You start by placing the sequences in a star topology (no internal branches). You then find that internal branch (take 2 sequences; join them; connect them to the rest by the internal branch) which when added to the tree will minimise the total branch length. The two joined sequences (neighbours) are merged into a single sequence and the process is repeated. For an unrooted tree with N sequences, there are N-3 internal branches. The above process is repeated N-3 times to give the final tree. The full details are given in Saitou and Nei (1987).

As explained elsewhere in the documentation, you can only root the tree by one of two methods:

1) assume a degree of constancy in the molecular clock and place the root along the longest branch (internal or external). Methods that appear to produce rooted trees automatically are often just doing this without letting you know; this is true of UPGMA.

2) root the tree on biological grounds. The usual method is to include an "outgroup", a sequence that you are certain will branch to the outside of the tree.




BOOTSTRAPPING.

Bootstrapping is a general purpose technique that can be used for placing confidence limits on statistics that you estimate without any knowledge of the underlying distribution (e.g. a normal or poisson distribution). In the case of phylogenetic trees, there are several analytical methods for placing confidence limits on groupings (actually on the internal branches) but these are either restricted to particular tree drawing methods or only work on small trees of 4 or 5 sequences. Felsenstein (1985) showed how to use bootstrapping to calculate confidence limits on trees. His approach is completely general and can be applied to any tree drawing method. The main assumption of the method in this context is that the sites in the alignment are independant; this will be true of some sequence alignments (e.g. pseudogenes) but not others (e.g. rRNA's). What effect, lack of independance will have on the results is not known.

The method works by taking random samples of data from the complete data set. You compute the test statistic (tree in this case) on each sample. Variation in the statistic computed from the samples gives a measure of variation in the statistic which can be used to calculate confidence intervals. Each random sample is the same size as the complete data set and is taken WITH REPLACEMENT i.e. a data point can be selected more than once (or not at all) in any given sample.

In the case of an alignment N residues long, each random sample is a random selection of N sites form the alignment. For each sample, we calculate a distance matrix and tree in the usual way. Variation in the sample trees compared to a tree calculated from the full data set gives an indication of how well supported the tree is by the data. If the sample trees are very similar to each other and to the full tree, then the tree is "strongly" supported; if the sample trees show great variation, then the tree will be weakly supported. In practice, you usually find some parts of a tree well supported, others weakly. This can be seen by counting how often each monophyletic group in the full tree occurs in the sample trees.

For a particular grouping, one considers it to be significant at the 95% level (P <= 0.05) if it occurs in 95% of the bootstrap samples. If a grouping is significant, it is significant with respect to the particular data set and method used for drawing the tree. Biological "significance" is another matter.



PHYLIP.

The Phylip package was written by Joe Felsenstein, University of Washington, USA. It provides Pascal source code for a large number of programs for doing most types of phylogenetic analyses. The Phylip format alignments produced by this program can be used by all of the Phylip programs, version 3.4 or later (March 1991). It is freely available from him as follows.

 ================= PHYLIP information sheet =====================

    PHYLIP - Phylogeny Inference Package (version 3.3)

 This is a FREE package of programs for inferring phylogenies and 
 carrying out certain related tasks.  At present it contains 28 
 programs, which carry out different algorithms on different kinds of 
 data.  The programs in the package are:

  ---------- Programs for molecular sequence data ----------
 PROTPARS  Protein parsimony          
 DNAPARS   Parsimony method for DNA
 DNAMOVE   Interactive DNA parsimony  
 DNAPENNY  Branch and bound for DNA
 DNABOOT   Bootstraps DNA parsimony   
 DNACOMP   Compatibility for DNA
 DNAINVAR  Phylogenetic invariants    
 DNAML     Maximum likelihood method
 DNAMLK    DNAML with molecular clock 
 DNADIST   Distances from sequences
 RESTML    ML for restriction sites

----------- Programs for distance matrix data ------------
 FITCH     Fitch-Margoliash and least-squares methods
 KITSCH    Fitch-Margoliash and least squares methods with
           evolutionary clock

--- Programs for gene frequencies and continuous characters --
 CONTML    Maximum likelihood method  
 GENDIST   Computes genetic distances

------------- Programs for discrete state data -----------
 MIX       Wagner, Camin-Sokal, and mixed parsimony criteria
 MOVE      Interactive Wagner, C-S, mixed parsimony program
 PENNY     Finds all most parsimonious trees by branch-and-bound
 BOOT      Bootstrap confidence interval on mixed parsimony methods
 DOLLOP, DOLMOVE, DOLPENNY, DOLBOOT   same as preceding four
           programs, but for the Dollo and polymorphism parsimony 
           criteria
 CLIQUE    Compatibility method       
 FACTOR    recode multistate characters

---- Programs for plotting trees and consensus trees ----
 DRAWGRAM  Draws cladograms and phenograms on screens, plotters and 
           printers
 DRAWTREE  Draws unrooted phylogenies on screens, plotters and 
           printers
 CONSENSE  Majority-rule and strict consensus trees

 The package includes extensive documentation files that provide the 
 information necessary to use and modify the programs.

 COMPATIBILITY: The programs are written in a very standard subset of 
 Pascal, a language that is available on most computers (including 
 microcomputers). The programs require only trivial modifications to 
 run on most machines: for example they work with only minor 
 modifications with Turbo Pascal, and without modifications on VAX 
 VMS Pascal. Pascal source code is distributed in the regular version 
 of PHYLIP: compiled object code is not.  To use that version, you 
 must have a Pascal compiler.

 DISKETTE DISTRIBUTION: The package is distributed in a variety of 
 microcomputer diskette formats.   You should send FORMATTED 
 diskettes, which I will return with the package written on them. 
 Unfortunately, I cannot write any Apple formats.   See below for how 
 many diskettes to send.  The programs on the magnetic tape or 
 electronic network versions may of course also be moved to 
 microcomputers using a terminal program.

 PRECOMPILED VERSIONS: Precompiled executable programs for PCDOS 
 systems are available from me.  Specify the "PCDOS executable 
 version" and send the number of extra diskettes indicated below.   
 An Apple Macintosh version with precompiled code is available from 
 Willem Ellis, Instituut voor Taxonomische Zoologie, Zoologisch 
 Museum, Universiteit van Amsterdam, Plantage Middenlaan 64, 1018DH 
 Amsterdam, Netherlands, who asks that you send 5 800K diskettes.

 HOW MANY DISKETTES TO SEND: The following table shows for different 
 PCDOS formats how many diskettes to send, and how many extra 
 diskettes to send for the PCDOS executable version: 

 Diskette size   Density   For source code    For executables, send
                                            in addition
   3.5 inch      1.44 Mb          2                     1
   5.25 inch      1.2 Mb          2                     2
   3.5 inch       720 Kb          4                     2
   5.25 inch      360 Kb          7                     4

 Some other formats are also available. You MUST tell me EXACTLY 
 which of these formats you need.  The diskettes MUST be formatted by 
 you before being sent to me. Sending an extra diskette may be 
 helpful.

 NETWORK DISTRIBUTION: The package is also available by distribution 
 of the files directly over electronic networks, and by anonymous ftp 
 from evolution.genetics.washington.edu.  Contact me by electronic 
 mail for details.

 TAPE DISTRIBUTION: The programs are also distributed on a magnetic 
 tape provided by you (which should be a small tape and need only be 
 able to hold two megabytes) in the following format: 9-track, ASCII, 
 odd parity, unlabelled,