Shell Scripts for use with
fastDNAml and DNArates
SUMMARY
UNIX shell scripts have proven quite useful in running the fastDNAml and/or
DNArates programs. They have been used in two different contexts. First,
many of the program options can be invoked by simple editing of the input.
The second category are scripts that help run and maintain results of the
program.
bootstrap add B (bootstrap) option (and optional seed) to input
categories add C (rate categories) option and values to input
categories_file add Y (categories file) option to input (DNArates)
clean_checkpoints remove checkpoint files when there is a finished treefile
clean_jumbles remove all but one optimal jumble for a given result
fastDNAml_boot loop over bootstrap seeds, doing 1 or more jumbles each
fastDNAml_loop do jumbles, stopping when same best tree found n times
frequencies add F option and user-defined frequencies to input
global add G (global) option (and optional region size) to input
jumble add J (jumble) option (and optional seed) to input
min_info add M (minimum information) option and value to input
n_categories add C (categories) option (without rate values) to input
out.PID append process ID to output file name of a program
outgroup add O (outgroup) option and number to input
printdata add 1 (print data) option to input
quickadd add Q (quickadd) option to input
restart add R (restart) option and checkpoint tree to input
scores summarize and sort likelihoods from jumble output files
transition add T (transition/transversion) option and value to input
treefile add Y (treefile) option to input
trees2NEXUS combine trees and add a NEXUS wrapper for PAUP and MacClade
trees2prolog convert Newick format trees to prolog facts
userlengths add L (userlengths) option to input
usertree add U (usertree) option, tree count, and tree(s) to input
usertrees add U (usertree) option, tree count, and tree(s) to input
weights add W (userweight) option and values to the input
weights_categories add W and C options and values to the input
SCRIPTS THAT INVOKE DNAML OPTIONS
GENERAL COMMENTS:
The program fastDNAml takes data from standard input. Thus, to run the
program with data in the file called "infile", the command would be
fastDNAml outfile
Because of the use of standard input, the input to fastDNAml can by
preprocessed by a function, and then piped to the program. For example,
bootstrap outfile
or
bootstrap 137 outfile
can be used to add the bootstrap option and a random number seed to the input,
and then pass it on to fastDNAml for analysis.
Many of the fastDNAml options are amenable to this arrangement. In
each case, the preprocessing can simply add options (and auxiliary data
lines, as necessary) to the input. In addition to avoiding the need to play
with UNIX text editors, there are several advantages to this approach:
1. The files remain relatively compatible with PHYLIP DNAML.
2. It reduces the chance of introducing errors into the data.
3. It is easier to try alternative options on the same data.
4. If the data for each sequence are provided in one long line (so that
interleaved and non-interleaved formats are the same), then some text
editors will truncate the lines.
Shell scripts are available for each of the above program options. The
corresponding formats and effects are described below.
THE SCRIPTS:
BOOTSTRAP (B)
Format: bootstrap [random_seed]
Example: bootstrap outfile
Example: bootstrap 137 outfile
Adds a bootstrap option and a random number seed to the input. If the random
seed is not supplied, then the process ID of the bootstrap shell is used. Thus,
repeated executions of the first example will tend to generate different random
samples (note that many systems only use about 32000 process IDs, so once you
get above 100 repetitions, reuse of the same number may become a significant
concern).
CATEGORIES (C)
Format: categories categories_data_file
Example: categories archae.rates archaea.out
Adds the categories option and the corresponding data to the input. The data
must have the format specified for PHYLIP dnaml 3.3. The first line must be
the letter C, followed by the number of categories (a number in the range 1
through 35), and then a blank-separated list of the rates for each category.
(The list can take more than one line; the program reads until it finds the
specified number of rate values.) The next line should be the word Categories
followed by one rate category character per sequence position. The categories
1 - 35 are represented by the series 1, 2, 3, ..., 8, 9, A, B, C, ..., Y, Z.
These latter data can be on one or more lines. For example,
C 12 0.0625 0.125 0.25 0.5 1 2 4 8 16 32 64 128
Categories 5111136343678975AAA8949995566778888889AAAAAA9239898629AAAAA9
633792246624457364222574877188898132984963499AA9899975
or, with more categories,
C 35 0.16529 0.29525 0.34482 0.40272 0.47035 0.54933 0.64157
0.74930 0.87512 1.02207 1.19369 1.39413 1.62823 1.90164
2.22096 2.59389 3.02945 3.53815 4.13227 4.82615 5.63654
6.58301 7.68841 8.97943 10.48723 12.24822 14.30490 16.70694
19.51232 22.78878 26.61541 31.08459 36.30423 42.40033 256.00000
Categories 4HHZ282111 21ED48H1HD Z1CD171411 1118F111EI IHI8ELBZZZ ZZZZZZZZZZ
ZZZZZZZZZZ 1MJZZMJLKL ZKL1ZZZZZZ ZZZZZZZZZZ ZZZZZZZZGH HHIGG43FOZ
Z2B9111324 1ZZZ171Z11 1184GH11ZZ IB1BBZ111J IB1ILKF4L1 21AEDE8111
111111ED9K 2219L3HGJ1 1Z1ZZMONMH ZZOMSQLM8Z 11411
(Notice that spaces are permitted in the categories data, and that the values
can extend across multiple lines. However, this means that extra values are
not permitted.)
In order to generate output compatible with PHYLIP dnaml v3.3, this should be
the first option added (so that the categories data are inserted immediately
before the sequence data).
CATEGORIES_FILE (Y)
Format: categories_file
Adds the Y option to the input data for the DNArates program. Makes the program write a file of weights and categories that can be directly added to the input
for the fastDNAml program (see weights_categories script).
Example: categories_file archaea.out
Adds the outgroup option and appropriate auxiliary data line to the input. The
example will infer a tree for the archaea data, root it on sequence 5, and
write a tree to treefile.PID, where PID is a number (the process ID of
fastDNAml). The textual output from fastDNAml (a description of the analysis)
is written to archaea.out.
PRINTDATA (1)
Format: printdata
Example: printdata archaea.out
Adds a printdata option to the input. In the example, the file archaea.out
will include an echoing of the data in addition to the usual output.
QUICKADD (Q)
Format: quickadd
Example: quickadd archaea.out
Adds a quickadd option to the input. This greatly decreases the time in
initially placing a new sequence in the growing tree (but does not change the
time required to subsequently test rearrangements). This will probably become
the default program behavior in the near future.
Any possible downside of the quickadd option would be a decreased frequency of
finding the globally optimal tree. Since you should NEVER depend on a single
order of addition yielding the best tree, multiple jumble runs will still be
the best way to check the reproducibility of any presumptively optimal tree.
Quickadd should let you do this more quickly!
RESTART (R)
Format: restart checkpoint_file_name
Example: quickadd archaea.out
Example: transition 2.0