COPYRIGHT NOTICE

Copyright 1988, 1991, 1992, 1994, 1995, 1996 by William R.

Pearson and the University of Virginia. All rights reserved. The

FASTA program and documentation may not be sold or incorporated

into a commercial product, in whole or in part, without written

consent of William R. Pearson and the University of Virginia.

For further information regarding permission for use or

reproduction, please contact: David Hudson, Assistant Provost for

Research, University of Virginia, P.O. Box 9025, Charlottesville,

VA 22906-9025, (804) 924-6853

 

The FASTA program package

Introduction

This documentation describes the version 2.0x of the FASTA

program package (see W. R. Pearson and D. J. Lipman (1988),

"Improved Tools for Biological Sequence Analysis", PNAS 85:2444-

2448, and W. R. Pearson (1990) "Rapid and Sensitive Sequence

Comparison with FASTP and FASTA" Methods in Enzymology 183:63-

98). Version 2.0 modifies version 1.8 to include explicit

statistical estimates for similarity scores based on the extreme

value distribution. In addition, FASTA protein alignments now

use the Smith-Waterman algorithm with no limitation on gap size.

FASTA and SSEARCH now use the BLOSUM50 matrix by default, with

options to change gap penalties on the command line. Version 1.7

replaces rdf2 and rss with prdf and prss, which use the extreme-

value distribution to calculate accurate probability estimates.

 

Although there are a large number of programs in this package,

they belong to four groups:

 

Library search programs: FASTA, FASTX, TFASTA, TFASTX, SSEARCH

Local homology programs: LFASTA, PLFASTA, LALIGN, PLALIGN, FLALIGN

Statistical significance: PRDF, RELATE, PRSS, RANDSEQ

Global alignment: ALIGN

 

 

In addition, I have included several programs for protein

sequence analysis, including a Kyte-Doolittle hydropathicity

plotting program (GREASE, TGREASE), and a secondary structure

prediction package (GARNIER).

The FASTA sequence comparison programs on this disk are

improved versions of the FASTP program, originally described in

Science (Lipman and Pearson, (1985) Science 227:1435-1441). We

have made several improvements. First, the library search

programs use a more sensitive method for the initial comparison

of two sequences which allows the scores of several similar

regions to be combined. As a result, the results of a library

search are now given with three scores, initn (the new initial

score which may include several similar regions), init1 (the old

fastp initial score from the best initial region), and opt (the

old fastp optimized score allowing gaps in a 32 residue wide

band).

These programs have also been modified to become "universal"

(hence FAST-A, for FASTA-All, as opposed to FAST-P (protein) or

FAST-N (nucleotides)); by changing the environment variable

SMATRIX, the programs can be used to search protein sequences,

DNA sequences, or whatever you like. By default, FASTA, LFASTA,

and the PRDF programs automatically recognize protein and DNA

sequences. Sequences are first read as amino acids, and then

converted to nucleotides if the sequence is greater than 85%

A,C,G,T (the '-n' option can be used to indicate DNA sequences).

TFASTA compares protein sequences to a translated DNA sequence.

Alternative scoring matrices can also be used. In addition to

the BLOSUM50 matrix for proteins, the PAM250 matrix or matrices

based on simple identities or the genetic code can also be used

for sequence comparisons or evaluation of significance. Several

different protein sequence matrices have been included;

instructions for constructing your own scoring matrix are

included in the file FORMAT.DOC.

 

The remainder of this document is divided into three sections:

(1) a brief history of the changes to the FASTA package; (2) A

guide to installing the programs and databases; (3) A guide to

using the FASTA programs. The programs are very easy to use, so

if you are using them on a machine that is administered by

someone else, you may want to skip to section (3) to learn how to

use the programs, and then read section (1) to look at some of

the more recent changes. If you are installing the programs on

your own machine, you will need to read section (2) carefully.

 

1. Revision History

1.1. Changes with version 2.0u

Version 2.0u provides several major improvements over

previous versions of FASTA (and SSEARCH). The most important is

the incorporation of explicit statistical estimates and

appropriate normalization of similarity scores. This improvement

is discussed in more detail below in the section entitled

Statistical Significance. In addition, all of the protein

comparison programs now use the BLOSUM50 matrix, with gap

penalties of -12, -2, by default. BLOSUM50 performs

significantly better than the older PAM250 matrix. PAM250 can

still be used with the command line option: -s 250. (DNA

sequence comparisons use a more stringent gap penalty of -16, -4,

which produces excellent statistical estimates when optimized

scores are used. TFASTA uses -16, -4 as well.)

The quality of the fit of the extreme value distribution to

the actual distribution of similarity scores is summarized with

the Kolmogorov-Smirnov statistic. The acceptance limits for this

statistic can be found in many statistics books. In general,

values <0.10 (N=30) indicate excellent agreement between the

actual and theoretical distributions. If this statistic is >

0.2, consider using a higher (more stringent) gap penalty, e.g.

-16, -4 rather than -12, -2. The default scoring matrix for DNA

has been changed to score +5 for an identity and -4 for a

mismatch. These are the same scores used by BLASTN.

With explicit expectation calculations, the program now

shows all scores and alignments with expectations less than 10.0

(with optimized scores, 2.0 without optimization) when the "-Q"

(quiet) mode is used. The expectation threshold can be changed

with the "-E" option.

Finally, the algorithm used to produce the final alignments

of protein sequences is now a full Smith-Waterman, with unlimited

gaps. (The older band-limited alignments are used for DNA

sequences and TFASTA by default, because Smith-Waterman

alignments are very slow for long sequences.) Both the optimized

and Smith-Waterman scores are reported; if the Smith-Waterman

score is higher, then additional gaps allowed a better alignment

and similarity score to be calculated.

FASTA searches now optimize similarity scores by default

(this slows searches about 2-fold (worst case) for ktup=2). Thus,

the meaning of the "-o" option has been reversed; "-o" now turns

off optimization and reports results sorted by "initn" scores.

Optimization significantly improves the sensitivity of FASTA, so

that it almost matches Smith-Waterman. With version 2.0, the

default band width used for optimized calculations can be varied

with the "-y" option. For proteins with ktup=2, a width of 16

(-y 16) is used; 16 is also used for DNA sequences. For proteins

and ktup=1, a width of 32 is used. Searches that disable

optimization with the "-o" option will work fine for sequences

that share 25% or more identity in general, but to detect

evolutionary relationships with 20% - 25% identity, the more

sensitive default optimization is often required. Optimization

is required for accurate statistical estimates with either

protein or DNA sequences.

The FASTA package now includes FASTX, a program that

compares a DNA sequence to a protein sequence database by

translating the DNA sequence in three frames (the reverse frames

are selected with the -i option) and aligning the three-frame

translation with the sequences in the protein database.

Alignment scores allow frameshifts so that a cDNA or EST sequence

with insertion/deletion errors can be aligned with its homologues

from beginning to end.

With release 20u6, there is also a TFASTX program, which is

a replacement for TFASTA. TFASTA treats each of the six reading

frames of a DNA library sequence as a different sequence; TFASTX

compares a protein sequence against only two sequences from each

DNA sequence - the forward and reverse orientation. For a given

orientation, TFASTX calculates a similarity score for alignments

that allow frameshifts, thus considering all possible reading

frames.

Another new program is included - randseq - which will

produce a randomly shuffled (uniform or local shuffle) from an

input sequence. This randomly shuffled sequence can be used to

evaluate the statistical estimates produced by FASTA, SSEARCH, or

BLAST.

1.2. Changes with version 1.7

Version 1.7 has been released to provide the PRDF and PRSS

programs for shuffling sequences and estimating accurately the

probabilities of the unshuffled-sequence scores.

PRDF a version of RDF2 that uses calculates the probability

of a similarity score more accurately by using a fit to

an extreme value distribution. Code to fit the extreme

value distribution parameters and the impetus to update

RDF2 was provided by Phil Green, U. of Washington.

PRSS a version of PRDF that uses a rigorous Smith-Waterman

calculation to score similarities

1.3. Changes with version 1.6

FASTA version 1.6 uses a new method for calculating optimal

scores in a band (the optimization or last step in the FASTA

algorithm). In addition, it uses a linear-space method for

calculating the actual alignments. FASTA v1.6 package includes

several new programs:

SSEARCH a program to search a sequence database using the

rigorous Smith-Waterman algorithm (this program is

about 100-fold slower than FASTA with ktup=2 (for

proteins).

LALIGN A rigorous local sequence alignment program that will

display the N-best local alignments (N=10 by default).

PLALIGN a version of lalign that plots the local alignments to

a tektronix display.

FLALIGN a version of lalign that plots the local alignments to

a GCG Figure file.

The LALIGN/PLALIGN/FLALIGN programs incorporate the "sim"

algorithm described by Huang and Miller (1991) Adv. Appl. Math.

12:337-357. The SSEARCH and PRSS programs incorporate algorithms

described by Huang, Hardison, and Miller (1990) CABIOS 6:373-381.

LFASTA and PLFASTA now calculate a different number of local

similarities; they now behave more like LALIGN/PLALIGN. Since

local alignments of identical sequences produce "mirror-image"

alignments, lalign and lfasta consider only one-half of the

potential alignments between sequences from identical file names.

Thus

lfasta mchu.aa mchu.aa

Displays only two alignments, with earlier versions of the

program, it would have displayed five, including the identity

alignment. PLFASTA does display five alignments; when two

identical filenames are given, it draws the identity alignment,

calculates the two unique local alignments, draws them, and draws

their mirror images. LFASTA/PLFASTA and LALIGN/PLALIGN use the

filenames, rather than the actual sequences, to determine whether

sequences are identical; you can "trick" the programs into

behaving the old way by putting the same sequence in two

different files.

1.4. Changes with version 1.5

FASTA version 1.5 includes a number of substantial revisions

to improve the performance and sensitivity of the program. It is

now possible to tell the program to optimize all of the initn

scores greater than a threshold. The threshold is set at the

same value as the old FASTA cutoff score. Alternatively, you can

tell FASTA to sort the results by the init1, rather than the

initn, score by using the -1 option. FASTA -1 ... will report

the results the way the older FASTP program did.

A new method has been provided for selecting libraries. In

the past, one could enter the name of a sequence file to be

searched or a single letter that would specify a library from the

list included in the $FASTLIBS file. Now, you can specify a set

of library files with a string of letters preceded by a '%'.

Thus, if the FASTLIBS file has the lines:

Genbank 70 primates$1P/seqlib/gbpri.seq 1

Genbank 70 rodents$1R/seqlib/gbrod.seq 1

Genbank 70 other mammals$1M/seqlib/gbmam.seq 1

Genbank 70 vertebrates $1B/seqlib/gbvrt.seq 1

Then the string: "%PRMB" would tell FASTA to search the four

libraries listed above. The %PRMB string can be entered either

on the command line or when the program asks for a filename or

library letter.

FASTA1.5 also provides additional flexibility for specifying

the number of results and alignments to be displayed with the -Q

(quiet) option. The -b number option allows you to specify the

number of sequence scores to show when the search is finished.

Thus

 

FASTA -b 100 ...

 

tells the program to display the top 100 sequence scores. In the

past, if you displayed 100 scores (in -Q mode), you would also

have store 100 alignments. The -d option allows you to limit the

number of alignments shown. FASTA -b 100 -d 20 would show 100

scores and 20 alignments.

Finally, FASTA can provide a complete list of all of the

sequences and scores calculated to a file with the -r (results)

option. FASTA -r results.out ... creates a file with a list of

scores for every sequence in the library. The list is not

sorted, and only includes those scores calculated during the

initial scan of the library.

2. Installing the FASTA package

2.1. Installing the programs

2.1.1. Unix version

The FASTA distribution comes with several makefile's that

can be used to compile the FASTA programs. Over the years, as

ATT Unix System 5 and BSD unix have converged, these files have

become very similar. To begin with, I recommend using the

standard Makefile. There are two values in the makefile that

should be checked against the values used on your system: the HZ

value, which is the frequency in ticks per second used by the

times() system call, this value can usually be found by running:

grep HZ /usr/include/sys/*

and the functions available to return random numbers. If you

have a rand48() function that returns a 32-bit random number, use

it and use the lines:

NRAND=nrand48

RANFLG= -DRAND32

If not, you will need to use the rand() function call and

determine whether it returns a 16-bit or a 32-bit value. These

functions are used by PRDF and PRSS. If you have problems

compiling the programs, you may want to examine the makefile.unx

and makefile.sun files, to look for differences. I have tried to

use very standard unix functions in these programs, and they have

been successfully compiled, with very small changes to the

Makefile, on Sun's (Sun OS 4.1), IBM RS/6000's (AIX), and MIPS

machines (under the BSD environment).

2.1.2. IBM-PC/DOS version

For the IBM-PC/DOS version, the FASTA source code disk

contains the complete source code to all of the programs on the

other disks. The programs were compiled with Borland's Turbo

'C++', using Borland's MAKE utility. The graphics programs

(PLFASTA, TGREASE) use the graphics device drivers supplied with

the Turbo 'C' V2.0 package. Also included are the documentation

files PROGRAMS.DOC and FORMAT.DOC. You do not need any of the

files the source code disk to run the programs. The files on

this disk are identical to the UNIX and VMS versions that run on

larger machines. Also included is the code to compile

ALIGN0.EXE. ALIGN0 is the same as ALIGN, but does not penalize

for end-gaps.

If you have the DOS or Macintosh version of the FASTA

package, to install the programs you should:

(1) Make a new directory (folder) for the FASTA programs.

This need not be the same as the directory for your

sequence databases.

(2) Copy the files from the FASTA source disk to the new

directory.

(3) (DOS only) Edit your AUTOEXEC.BAT file to (a) modify your

PATH command to include the FASTA directory and (b) add

the line:

set FASTLIBS=c:\yourfastadirectory\fastgbs

On the Macintosh, you may need to edit the "environment"

file and change the line that reads:

FASTLIBS=fastgbs

to indicate the full directory path for the fastgbs file,

for example:

FASTLIBS=Q105:FASTA:fastgbs

 

(4) Finally, you will need to edit the fastgbs file. This is

usually the most confusing part of the installation. An

example of this file is shown below; to customize this

file for your machine, you will need to change the file

names from those provided in the fastgbs file to ones that

reflect the directory names and file names you use on your

machine. This is explained in more detail below. In

addition, some entries in the fastgbs file refer to other

files of file names. These files of file names (as

opposed to actual database files) may also need to be

edited.

2.2. Installing the libraries

2.2.1. The NBRF protein sequence library

The FASTA program package does not include any protein or

DNA sequence libraries. You can obtain the PIR protein sequence

database from:

National Biomedical Research Foundation

Georgetown University Medical Center

3900 Reservoir Rd, N.W.

Washington, D.C. 20007

In addition, this database is available via anonymous ftp from

the host "ftp.bchs.uh.edu". It is available in two formats, VMS

and CODATA format. The "VMS" format (library type 5 below) can

be searched much faster, can be easily reformatted for use by the

"BLAST" rapid searching program, and is compatible with the

Genetics Computer Group package of programs. The CODATA format

is used by the EUGENE/MBIR computing package from Baylor (library

type 2).

2.2.2. The GENBANK DNA sequence library

FASTA, and TFASTA search sequences from the GENBANK

"flatfile" (not ASN.1) DNA sequence library in the flat-file

format distributed by the National Center for Biotechnology

Information and the PIR format used by EBI/EMBL. CD-ROMs can be

obtained from:

Genbank

National Center for Biotechnology Information

National Library of Medicine

National Institutes of Health

8600 Rockville Pike

Bethesda, MD 20894

 

The GenBank DNA sequence library is also available via

anonymous FTP from ncbi.nlm.nih.gov.

2.2.3. The EBI/EMBL CD-ROM libraries

The European Bioinformatics Institute (EBI) is now

distributing the EMBL CD-ROM that contains both the complete EMBL

DNA sequence database (which should be essentially identical to

the GenBank DNA sequence database) and the SWISS-PROT protein

sequence database. SWISS-PROT is derived from the NBRF Protein

sequence database with additions from the EBI/EMBL DNA sequence

database. This CD-ROM is a "best-buy," since it provides both

DNA and protein sequence libraries. It is available from:

 

European Bioinformatics Institute

Hinxton Genome Campus, Hinxton Hall

Hinxton, Cambridge CB10 1RQ,

United Kingdom

Tel: +44 1223 4944

Fax: +44 1223 494468

Email: DATALIB@ebi.ac.uk

 

 

In addition, the SWISS-PROT protein sequence database is

available via anonymous FTP from ncbi.nlm.nih.gov.

2.3. Finding the libraries: FASTLIBS

FASTA and TFASTA use the environment variable FASTLIBS to

find the protein and DNA sequence libraries. The FASTLIBS

variable contains the name of a file that has the actual

filenames of the libraries. The FASTGBS file on is an example of

a file that can be referred to by FASTLIBS. To use the FASTGBS

file, type:

setenv FASTLIBS /usr/lib/fasta/fastgbs (BSD UNIX/csh)

or

export FASTLIBS=/usr/lib/fasta/fastgbs (SysV UNIX/ksh)

Then edit the FASTGBS file to indicate where the protein and DNA

sequence libraries can be found. If you have a hard disk and

your protein sequence library is kept in the file

/usr/lib/aabank.lib and your Genbank DNA sequence library is kept

in the directory: /usr/lib/genbank, then fastgbs might contain:

NBRF Protein$0P/usr/lib/seq/aabank.lib 0

SWISS PROT 10$0S/usr/lib/vmspir/swiss.seq 5

GB Primate$1P@/usr/lib/genbank/gpri.nam

GB Rodent$1R@/usr/lib/genbank/grod.nam

GB Mammal$1M@/usr/lib/genbank/gmammal.nam

^ 1 ^^^^ 4 ^ ^

23 (5)

The first line of this file says that there is a copy of the NBRF

protein sequence database (which is a protein database) that can

be selected by typing "P" on the command line or when the

database menu is presented in the file /usr/lib/seq/aabank.lib.

Note that there are 4 or 5 fields in the lines in fastgbs.

The first field is the description of the library which will be

displayed by FASTA; it ends with a '$'. The second field (1

character), is a 0 if the library is a protein library and 1 if

it is a DNA library. The third field (1 character) is the

character to be typed to select the library.

The fourth field is the name of the library file. In the

example above, the /usr/lib/seq/aabank.lib file contains the

entire protein sequence library. However the DNA library file

names are preceded by a '@', because these files (gpri.nam,

grod.nam, gmammal.nam) do not contain the sequences; instead they

contain the names of the files which contain the sequences. This

is done because the GENBANK DNA database is broken down in to a

large number of smaller files. In order to search the entire

primate database, you must search more than a dozen files.

In addition, an optional fifth field can be used to specify

the format of the library file. Alternatively, you can specify

the library format in a file of file names (a file preceded by an

'@'). This field must be separated from the file name by a space

character (' ') from the filename. In the example above, the

aabank.lib file is in Pearson/FASTA format, while the swiss.seq

file is in PIR/VMS format (from the EMBL CD-ROM). Currently,

FASTA can read the following formats:

0 Pearson/FASTA (>SEQID - comment/sequence)

1 Uncompressed Genbank (LOCUS/DEFINITION/ORIGIN)

2 NBRF CODATA (ENTRY/SEQUENCE)

3 EMBL/SWISS-PROT (ID/DE/SQ)

4 Intelligenetics (;comment/SEQID/sequence)

5 NBRF/PIR VMS (>P1;SEQID/comment/sequence)

6 GCG (version 8.0) Unix Protein and DNA (compressed)

11 NCBI Blast1.3.2 format (unix only)

In particular, this version will work with the EMBL and PIR VMS

formats that are distributed on the EMBL CD-ROM. The latter

format (PIR VMS) is much faster to search than EMBL format. This

release also works with the protein and DNA database formats

created for the BLASTP and BLASTN programs by SETDB and PRESSDB

and with the new NCBI search format. If a library format is not

specified, for example, because you are just comparing two

sequences, Pearson/FASTA (format 0) is used by default. To

change this default, you may set the LIBTYPE environment variable

to a number. For example,

setenv LIBTYPE 1

would cause the program to use the GenBank LOCUS format by

default for libraries (or the second sequence file), but the

Pearson/FASTA format would still be used for the query sequence.

You can specify a group of library files by putting a '@'

symbol before a file that contains a list of file names to be

searched. For example, if @gmam.nam is in the fastgbs file, the

file "gmam.nam" might contain the lines:

</usr/lib/genbank

gbpri.seq 1

gbrod.seq 1

gbmam.seq 1

In this case, the line beginning with a '<' indicates the

directory the files will be found in. The remaining lines name

the actual sequence files. So the first sequence file to be

searched would be:

/usr/lib/genbank/gbpri.seq

The notation "<PIRNAQ:" might be used under the VAX/VMS operating

system. Under UNIX, the trailing '/' is left off, so the library

directory might be written as "</usr/seqlib".

With version 1.4 of the FASTA package, the FASTA and TFASTA

programs can search a library composed of different files in

different sequence formats. For example, you may wish to search

the Genbank files (in GenBank flat file format) and the EMBL DNA

sequence database on CD-ROM. To do this, you simply list the

names and filetypes of the files to be searched in a file of

filenames. For example, to search the mammalian portion of

Genbank, the unannotated portion of Genbank, and the unannotated

portion of the EMBL library, you could use the file:

</usr/lib/DNA

gbpri.seq 1

# (this '#' causes the program to display the size of the library)

gbrod.seq 1

gbmam.seq 1

gbuna.seq 1

unanno.seq 5

#

You do not need to include library format numbers if you

only use the Pearson/FASTA version of the PIR protein se-

quence library. If no library type is specified, the

program assumes that type 0 is being used (unless you

have set LIBTYPE).

Support for the old compressed GenBank files, which have not been

distributed for more than four years, has been removed from

programs in the FASTA package.

 

Test the setup by running FASTA. Enter the sequence file

'MUSPLFM.AA' when the program requests it (this file is included

with the programs). The program should then ask you to select a

protein sequence library. Alternatively, if you run the TFASTA

program and use the MUSPLFM.AA query sequence, the program should

show you a selection of DNA sequence libraries. Once the fastgbs

file has been set up correctly, you can set FASTLIBS=fastgbs in

your AUTOEXEC.BAT file, and you will not need to remember where

the libraries are kept or how they are named.

FASTA and TFASTA must open a large number of files when

searching and reporting the results of a GENBANK floppy disk

format library search. You may have problems with the large

number of files under DOS on IBM-PC's (Unix and VMS users will

not have these problems). If you are going to search the GENBANK

floppy disk format DNA sequence library under DOS, you should add

the line:

FILES=16

to your CONFIG.SYS file. (Typically this is already done for

programs like Windows or WordPerfect.)

3. Using the FASTA Package

3.1. Overview

The FASTA sequence comparison programs all require similar

information, the name of a query sequence file, a library file,

and the ktup parameter. All of the programs can accept arguments

on the command line, or they will prompt for the file names and

ktup value.

To use FASTA, simply type:

FASTA

and you will be prompted for :

the name of the test sequence file

the name of the library file

and whether you want ktup = 1 or 2. (or 1 to 6 for DNA sequences)

ktup of 2 is about 5 times faster than ktup = 1.

For a 200 aa sequence against a 10,000,000 aa

library, the program takes about 30 min with

ktup = 2, 150 min with ktup = 1, on a 12 Mhz 286

IBM-PC.

 

The program can also be run by typing

FASTA test.aa /lib/bigfile.lib ktup (1 or 2)

 

Included with the package are the test files, MUSPLFM.AA,

LCBO.AA, MCHU.AA and BOVPRL.SEQ. To check to make certain that

everything is working, you can try:

fasta musplfm.aa lcbo.aa

and

tfasta musplfm.aa bovprl.seq

To test the local similarity programs LFASTA and PLFASTA, try:

lfasta mchu.aa mchu.aa

and

plfasta mchu.aa mchu.aa (use this only on an IBM-PC with graphics

or on a Tektronix terminal under UNIX or VMS)

MCHU (calmodulin) has four duplicated calcium binding sites that

are clearly detected by LFASTA. For a more complicated example,

try MWRTC1.aa, myosin heavy chain.

3.2. Sequence files

The FASTA programs know about three kinds of sequence files

(four under VMS): (1) plain sequence files that can only be used

as query sequences or for LFASTA, PRDF, and ALIGN. (2) Standard

library files. These are the same as plain sequence files, each

sequence is preceded by a comment line with a '>' in the first

column. (3) distributed sequence libraries (this is a broad class

that includes the NBRF/PIR VMS and blocked ascii formats, Genbank

flat-file format, EMBL flat-file format, and Intelligenetics

format. All of the files that you create should be of type (1)

or (2). Type (2) files (ones with a be used as query or library

sequence files by all of the programs.

I have included several sample test files, *.AA. The first

line may begin with a '>' or ';' followed by a comment. The

text after ';' in other lines will be ignored. Spaces and

tabs (and anything else that is not an amino-acid code) are

ignored.

Library files should have the form:

>Sequence name and identifier

A F A S Y T .... actual sequence.

F S S .... second line of sequence.

>Next sequence name and identifier

This is often referred to as "FASTA" or "Pearson" format. You

can build your own library by concatenating several sequence

files. Just be sure that each sequence is preceded by a line

beginning with a '>' with a sequence name.

The test file should not have lines longer than 120

characters, and sequences entered with word processors should use

a document mode, with normal carriage returns at the end of

lines.

Program Summary

3.3. Sequence search programs

FASTA universal sequence comparison. Defaults to comparing

protein sequences; if the sequences are > 85% A+C+G+T

or the -n option is used, a DNA sequence is assumed.

FASTX Search a protein sequence library using amino acid

sequence comparison to the forward three frames of a

translated DNA query sequence. (The reverse frames are

specified with the -i option.) Alignment scores allow

frameshifts; the final alignment uses a Smith-Waterman

type alignment routine (no limit on gaps) that allows

frameshifts.

TFASTA Search DNA library for a protein sequence by

translating the DNA sequence to protein in all six

frames (three forward frames with the -3 command line

option). TFASTA with ktup=2 is about as fast as a DNA

FASTA with ktup=4, and is substantially more sensitive.

(also reads the GENBANK library)

TFASTX Search DNA library for a protein sequence by

translating the DNA sequence to protein in all six

frames (three forward frames with the -3 command line

option) calculating similarity scores that allow

frameshifts. TFASTX produces an optimal Smith-Waterman

alignment of the query and translated-library sequence.

SSEARCH Universal sequence comparison using the Smith-Waterman

algorithm ( T. F. Smith and M. S. Waterman (1981) J.

Mol. Biol. 147:195-197). This program uses code

developed by Huang and Miller (X. Huang, R. C.

Hardison, W. Miller (1990) CABIOS 6:373-381) for

calculating the local similarity score and code from

the ALIGN program (see below) for calculating the local

alignment. SSEARCH is about 50-times slower than FASTA

with ktup=2 (for proteins).

ALIGN optimal global alignment of two sequences with no

short-cuts. This program is a slightly modified

version of one taken from E. Myers and W. Miller. The

algorithm is described in E. Myers and W. Miller,

"Optimal Alignments in Linear Space" (CABIOS (1988)

4:11-17).

3.4. Local similarity programs

LFASTA local similarity searches showing local alignments.

The algorithm used to calculate the local alignment in

a band has been improved (Chao, Pearson, and Miller,

submitted).

PLFASTA local similarity searches with plot output (on the IBM,

this program requires that the environment variable

BGIDIR be set).

PCLFASTA (unix only) local similarity searches with plot output

using pic commands.

LALIGN Calculates the N-best local alignments using a rigorous

algorithm. (N=10 by default.) The algorithm was

developed by Huang and Miller (X. Huang and W. Miller

(1991) Adv. Appl. Math. 12:337-357), which is a

linear-space version of an algorithm described by M. S.

Waterman and M. Eggert (J. Mol. Biol. 197:723-728).

Like SSEARCH, LALIGN is rigorous, but also very slow.

PLALIGN A version of LALIGN that plots its output to a screen

or to a Tektronix terminal emulator.

3.5. Statistical Significance

With version 2.0 of the FASTA program distribution, FASTA,

TFASTA, and SSEARCH now provide estimates of statistical

significance for library searches. Work by Altschul, Arratia,

Karlin, Mott, Waterman, and others (see Altschul et al. (1994)

Nature Genetics 6:119 for an excellent review) suggests that

local sequence similarity scores follow the extreme value

distribution, so that P(s > x) = 1 - exp(-exp(-lambda(x-u)) where

u = ln(Kmn)/lambda and m,m are the lengths of the query and

library sequence. This formula can be rewritten as: 1 - exp(-Kmn

exp(-lambda x), which shows that the average score for an

unrelated library sequence increases with the logarithm of the

length of the library sequence. FASTA and SSEARCH use simple

linear regression against the the log of the library sequence

length to calculate a normalized "z-score" with mean 50,

regardless of library sequence length, and variance 10. These

z-scores can then be used with the extreme value distribution and

the poisson distribution (to account for the fact that each

library sequence comparison is an independent test) to calculate

the number of library sequences to obtain a score greater than or

equal to the score obtained in the search. The original idea and

routines to do the linear regression on library sequence length

were provided Phil Green, U. Washington. This version of FASTA

and SSEARCH uses a slightly different strategy for fitting the

data than those originally provided by Dr. Green.

The expected number of sequences is plotted in the histogram

using an "*". Since the parameters for the extreme value

distribution are not calculated directly from the distribution of

similarity scores, the pattern of "*'s" in the histogram gives a

qualitative view of how well the statistical theory fits the

similarity scores calculated by FASTA and SSEARCH. For FASTA, if

optimized scores are calculated for each sequence in the database

(the default), the agreement between the actual distribution of

"z-scores" and the expected distribution based on the length

dependence of the score and the extreme value distribution is

usually very good. Likewise, the distribution of SSEARCH Smith-

Waterman scores typically agrees closely with the actual

distribution of "z-scores." The agreement with unoptimized

scores, ktup=2, is often not very good, with too many high

scoring sequences and too few low scoring sequences compared with

the predicted relationship between sequence length and similarity

score. In those cases, the expectation values may be

overestimates.

The statistical routines assume that the library contains a

large sample of unrelated sequences. If this is not the case,

then the expectation values are meaningless. Likewise, if there

are fewer than 20 sequences in the library, the statistical

calculations are not done.

For protein searches, library sequences with E() values <

0.01 for searches of a 10,000 entry protein database are almost

always homologous. Frequently sequences with E()-values from 1 -

10 are related as well. Remember, however, that these E() values

also reflect differences between the amino acid composition of

the query sequence and that of the "average" library sequence.

Thus, when searches are done with query sequences with "biased"

amino-acid composition, unrelated sequences may have

"significant" scores because of sequence bias. The programs

below, PRDF and PRSS, can address this problem by calculating

similarity scores for random sequences with the same length and

amino acid composition.

If optimization is not used ("-o"), E-values for DNA

sequences overestimate the significance of the scores that are

obtained and unrelated sequences frequently have E()-values <

0.0005. With optimization, the agreement between E()-value

compares favorably with protein sequence comparison. This is in

part due to the use of more stringent gap penalties for DNA

sequence comparison, -16, -4 rather than -12, -2. With the

latter penalties, many unrelated sequences appear to have

significant similarity. Nevertheless, since protein sequence

comparison is much more sensitive, DNA sequence comparison should

not be used to identify sequences that encode protein. Even with

ktup=6, optimization rarely increases run-times more than 50%

with mRNA-size query sequences. Optimization should be used

whenever possible.

Similar comments apply to TFASTA, where higher gap

penalties (-16,-4) are required for accurate statistical

estimates. Because TFASTA produces so many artificial "coding"

sequences with atypical amino acid compositions, the statistical

estimates with TFASTA are often over estimates. With optimized

scores, ktup=1, and gap penalties of -16, -4, unrelated sequences

will sometimes have E() values of 0.1. If initn scores are used,

unrelated sequences may have have E() values < 0.01.

PRDF improved version of RDF program that includes accurate

probability estimates for all three scoring methods

(includes local or window shuffle routine)

PRSS A version of PRDF that uses the rigorous Smith-Waterman

calculation used by SSEARCH.

RANDSEQ produces a randomly shuffled sequence from a query

sequence.

RELATE significance program described by Dayhoff (Atlas of

Protein Sequence and Structure, Vol. 5, Supplement 3).

Each chunk of 25 residues in one sequence is compared

to every 25 residue fragment of the second sequence.

Sequences which are genuinely related will have a large

number of scores greater than 3 standard deviations

above the mean score of all of the comparisons.

3.6. Other analysis programs

AACOMP calculate the amino acid composition and molecular

weight of a sequence.

BESTSCOR calculate the best self-comparison score.

GREASE Kyte-Doolittle hydropathicity profile

TGREASE graphic plot of Kyte-Doolittle profile

FROMGB convert from GenBank LOCUS format (also used by the

IBI-Pustell programs) to Pearson/FASTA format.

GARNIER A secondary structure prediction program using the

method of Garnier, Osgusthorpe, and Robson, J. Mol.

Biol., (1978) 120:97-120.

3.7. Options

These programs have a number of output options, which are

invoked by the environment variables LINLEN, SHOWALL, and MARKX.

Alternatively, these values can be controlled by command line

options. The number of sequence residues per output line is now

adjustable by setting the environment variable LINLEN, or the

command line option -w. LINLEN is normally 60, to change it set

LINLEN=80 before running the program or add -w 80 to the command

line. LINLEN can be set up to 200. SHOWALL (-a) determines

whether all, or just a portion, of the aligned sequences are

displayed. Previously, FASTP would show the entire length of

both sequences in an alignment while FASTN would only show the

portions of the two sequences that overlapped. Now the default is

to show only the overlap between the two sequences, to show

complete sequences, set SHOWALL=1, or use the -a option on the

command line.

The differences between the two aligned sequences can be

highlighted in three different ways by changing the environment

variable MARKX or the -m option. Normally (MARKX=0) the program

uses ':' do denote identities and '.' to denote conservative

replacements. If MARKX=1, the program will not mark identities;

instead conservative replacements are denoted by a 'x' and non-

conservative substitutions by a 'X'. If MARKX=2, the residues in

the second sequence are only shown if they are different from the

first. MARKX=3 displays the aligned library sequences without the

query sequence; these can be used to build a primitive multiple

alignment. MARKX=4 provides a graphical display of the

boundaries of the alignments. Thus the five options are:

 

MARKX=0 MARKX=1 MARKX=2 MARKX=3 MARKX=4

MWRTCGPPYT MWRTCGPPYT MWRTCGPPYT MWRTCGPPYT

::..:: ::: xx X ..KS..Y... MWKSCGYPYT ----------

MWKSCGYPYT MWKSCGYPYT

 

(fasta20u4, Feb. 1996) In addition MARKX=10 is a new, parseable

format for use with other programs. See the file"readme.v20u4"

for a more complete description.

3.8. Command line options

It is now possible to specify several options on the

command line, instead of using environment variables. The

command line options are preceded by a dash; the following

options are available:

-a same as showall=1

-A force Smith-Waterman alignments for DNA sequences and

TFASA. By default, only FASTA protein sequence

comparisons use Smith-Waterman alignments.

-b # Number of sequence scores to be shown on output. In

the absence of this option, fasta (and tfasta and

ssearch) display all library sequences obtaining

similarity scores with expectations less than 10.0 if

optimized score are used, or 2.0 if they are not. The

-b option can limit the display further, but it will

not cause additional sequences to be displayed.

-c # Threshold score for optimization (OPTCUT). Set "-c 1"

to optimize every sequence in a database. (This slows

the program down about 5-fold).

-E # Limit the number of scores and alignments shown based

on the expected number of scores. Used to override the

expectation value of 10.0 used by default. When used

with -Q, -E 2.0 will show all library sequences with

scores with an expectation value <= 2.0.

-d # Number of alignments to be reported by default. (Used

in conjunction with -Q). No longer necessary, see "-b"

above.

-f Penalty for the first residue in a gap (-12 by default

for proteins, -16 for DNA or for TFASTA).

-g Penalty for additional residues in a gap (-2 by default

for proteins, -4 for DNA and TFASTA ).

-h Penalty for frameshift (FASTX, TFASTX only).

-H Omit histogram.

-i Invert (reverse complement) the query sequence if it is

DNA. For TFASTX, search the reverse complement of the

library sequence only.

-k # Threshold for joining init1 segments to build an initn

score (GAPCUT).

-l file Location of library menu file (FASTLIBS).

-L Display more information about the library sequence in

the alignment.

-m # MARKX = # (0, 1, 2, 3, 4, 10)

-n Force the query sequence to be treated as a DNA

sequence. This is particularly useful for query

sequences that contain a large number of ambiguous

residues, e.g. transcription factor binding sites.

-O Send copy of results to "filename." Helpful for

environments without STDOUT.

-o Turn off default optimization of all scores greater

than OPTCUT. Sort results by "initn" scores.

-Q,-q Quiet - does not prompt for any input. Writes scores

and alignments to the terminal or standard output file.

-r file Save a results summary line for every sequence in the

sequence library. The summary line includes the

sequence identifier, superfamily number (if available)

position in the library, and the similarity scores

calculated. This option can be used to evaluate the

sensitivity and selectivity of different search

strategies (see W. R. Pearson (1991) Genomics 11:635-

650.)

-s file SMATRIX is read from file. Several SMATRIX files are

provided with the standard distribution. For protein

sequences: codaa.mat - based on minimum mutation

matrix; idnaa.mat - identity matrix; pam250.mat - the

PAM250 matrix developed by Dayhoff et al (Atlas of

Protein Sequence and Structure, vol. 5, suppl. 3,

1978); pam120.mat - a PAM120 matrix. The default

scoring matrix is BLOSUM50, PAM250 is available with

"-s 250", BLOSUM62 ("-s BL62") is also available.

-v (LINEVAL) values used for line styles in plfasta

-w # Line length (width) = number (<200)

-x Specifies offsets for the beginning of the query and

library sequence. For example, if you are comparing

upstream regions for two genes, and the first sequence

contains 500 nt of upstream sequence while the second

contains 300 nt of upstream sequence, you might try:

fasta -x "-500 -300" seq1.nt seq2.nt

If the -x option is not used, FASTA assumes numbering

starts with 1. This option will not work properly with

the translated library sequence with tfasta. (You

should double check to be certain the negative

numbering works properly.)

-y Set the width of the band used for calculating

"optimized" scores. For proteins and ktup=2, the width

is 16. For proteins with ktup=1, the width is 32 by

default. For DNA the width is 16.

-z Turn off statistical calculations.

-1 sort output by init1 score (as FASTP used to do).

-3 (TFASTA, TFASTX only) translate only three forward

frames

 

For example:

fasta -w 80 -a seq1.aa seq.aa

would compare the sequence in seq1.aa to that in seq2.aa and

display the results with 80 residues on an output line, showing

all of the residues in both sequences. Be sure to enter the

options before entering the file names, or just enter the options

on the command line, and the program will prompt for the file

names.

Not all of these options are appropriate for all of the

programs. The options above are used by FASTA and TFASTA. RELATE

uses the -s option, ALIGN uses the -w, -m, and -s options, and

the PRDF program uses -c, -f, -k, and -s.

4. Environment variable summary

Environment variables allow you to set search parameters

that will be used frequently when you run a program; for example,

if you prefer to use the PAM250 scoring matrix, you might "set

SMATRIX=250." Command line parameters, if used, always override

environment variable settings. The following environment

variables are used by this program:

AABANK the file name of the default sequence library.

FASTLIBS the location of the file which contains the list of

library files to be searched.

GAPCUT threshold used for joining init1 regions in the second

step of FASTA. Normally set based on sequence length

and ktup.

LIBTYPE used to specify the format of the library sequence for

FASTA and TFASTA.

LINLEN output line length - can go up to 200

LINEVAL used by plfasta to determine the relationship between

line style and similarity score (-v). This should be a

string of three numbers, e.g. "200 100 50"

MARKX symbol for denoting matches, mismatches. Note that this

symbol is only used across the optimized local region;

sequences that are outside this region are not marked.

OPTCUT Set the threshold to be used for optimization in a band

around the best initial region. Normally the OPTCUT

value is calculated from the length of the sequence and

the ktup value (for a 200 residue sequence, it is about

28). If OPTCUT=1, every sequence in the database will

be optimized. This is the most sensitive option.

PAMFACT This version of fasta uses a more sensitive method for

identifying initial regions. Instead of using a

constant factor (fact) for each match in a ktup, it

uses the scoring matrix (PAM) scores. While this works

well for protein sequences, it has not been as

carefully tested for DNA sequences, so by default, this

modification is used for proteins but not for DNA.

Setting the PAMFACT environment variable to 1 forces

the option on; PAMFACT=0 turns it off.

SHOWALL on output, show the complete sequence instead of just

the overlap of the two aligned sequences.

SMATRIX alternative scoring matrix file.

TEKPLOT (IBM-PC only, Unix and VMS versions generate Tektronix

graphics by default) Generate Tektronix output.

Normally, PLFASTA and TGREASE plot graphs using the

Turbo C graphics library. Unfortunately, often these

plots cannot be printed out without special programs.

However, if you set TEKPLOT=1, tektronix graphics

commands will be used. Tektronix commands can be used

together with the PLOTDEV program, available from

Microplot Systems. They no lonter sell this program,

but it can be downloaded from

http://iquest.com/~microplt/index1.html. PLOTDEV also

allows you to print out graphics on the screen.

As always, please inform me of bugs as soon as possible.

William R. Pearson

Department of Biochemistry

Box 440, Jordan Hall

U. of Virginia

Charlottesville, VA

wrp@virginia.EDU