[ Program Manual | User's Guide | Data Files | Databases ]
TransMem scans for likely transmembrane helices in one or more input protein sequences.
TransMem builds on the method of Sonnhammer et al. (Proc. of Sixth Int. Conf. on Intelligent Systems for Molecular Biology, 175-182 (1998)) to predict likely transmembrane helices in one or more input proteins. The method is based upon a Hidden Markov Model (HMM) that has been trained on a set of membrane proteins with helical membrane spanning regions.
Here is a session using TransMem to generate two predictions for the delta subunit of the mouse GABA(A) receptor.
% transmem sw:gad_mouse TransMem scans for likely transmembrane helices in one or more input protein sequences. Number of different annotations of each sequence (* 1 *) ? 2 Proximity of feature boundaries to consider annotations equivalent (* 0 *) ? What should I call the output file (* gad_mouse.transmem *) ? Helix Inside Outside Relative Score GAD_MOUSE0 4 2 3 0.0000 GAD_MOUSE1 4 2 3 0.4922 CPU time: 0.560000 Sequences examined: 1 Sequences written: 2 Results written to "gad_mouse.transmem" %
The output from TransMem is a list file, and is suitable for input to any GCG program that allows indirect file specifications. (For information about indirect file specification, see Chapter 2, Using Sequence Files and Databases of the User's Guide.)
!!SEQUENCE_LIST 1.0
TransMem of sw:gad_mouse
-MINHelix = 1
-MEthod=Nbest
-NBest = 2
-PROXimity = 0
August 13, 2001 16:36
Helix Inside Outside Relative Score
..
sw:gad_mouse ! 4 2 3 0.0000
sw:gad_mouse ! 4 2 3 0.4922
\\End of List
>>SW:GAD_MOUSE
P22933 mus musculus (mouse). gamma-aminobutyric-acid receptor delta subunit prec
Begin End
Outside 1 249
Helix 250 272
Inside 273 278
Helix 279 297
Outside 298 311
Helix 312 334
Inside 335 425
Helix 426 448
Outside 449 449
Outside 1 248
Helix 249 271
Inside 272 277
Helix 278 296
Outside 297 310
Helix 311 333
Inside 334 425
Helix 426 448
Outside 449 449
OOOOOOOOOO OOOOOOOOOO OOOOOOOOOO OOOOOOOOOO OOOOOOOOOO
OOOOOOOOOO OOOOOOOOOO OOOOOOOOOO OOOOOOOOOO OOOOOOOOOO
1 MDVLGWLLLP LLLLCTQPHH GARAMNDIGD YVGSNLEISW LPNLDGLMEG
OOOOOOOOOO OOOOOOOOOO OOOOOOOOOO OOOOOOOOOO OOOOOOOOOO
OOOOOOOOOO OOOOOOOOOO OOOOOOOOOO OOOOOOOOOO OOOOOOOOOO
51 YARNFRPGIG GAPVNVALAL EVASIDHISE ANMEYTMTVF LHQSWRDSRL
OOOOOOOOOO OOOOOOOOOO OOOOOOOOOO OOOOOOOOOO OOOOOOOOOO
OOOOOOOOOO OOOOOOOOOO OOOOOOOOOO OOOOOOOOOO OOOOOOOOOO
101 SYNHTNETLG LDSRFVDKLW LPDTFIVNAK SAWFHDVTVE NKLIRLQPDG
OOOOOOOOOO OOOOOOOOOO OOOOOOOOOO OOOOOOOOOO OOOOOOOOOO
OOOOOOOOOO OOOOOOOOOO OOOOOOOOOO OOOOOOOOOO OOOOOOOOOO
151 VILYSIRITS TVACDMDLAK YPLDEQECML DLESYGYSSE DIVYYWSENQ
OOOOOOOOOO OOOOOOOOOO OOOOOOOOOO OOOOOOOOOO OOOOOOOOOH
OOOOOOOOOO OOOOOOOOOO OOOOOOOOOO OOOOOOOOOO OOOOOOOOHH
201 EQIHGLDRLQ LAQFTITSYR FTTELMNFKS AGQFPRLSLH FQLRRNRGVY
HHHHHHHHHH HHHHHHHHHH HHIIIIIIHH HHHHHHHHHH HHHHHHHOOO
HHHHHHHHHH HHHHHHHHHH HIIIIIIHHH HHHHHHHHHH HHHHHHOOOO
251 IIQSYMPSVL LVAMSWVSFW ISQAAVPARV SLGITTVLTM TTLMVSARSS
OOOOOOOOOO OHHHHHHHHH HHHHHHHHHH HHHHIIIIII IIIIIIIIII
OOOOOOOOOO HHHHHHHHHH HHHHHHHHHH HHHIIIIIII IIIIIIIIII
301 LPRASAIKAL DVYFWICYVF VFAALVEYAF AHFNADYRKK RKAKVKVTKP
IIIIIIIIII IIIIIIIIII IIIIIIIIII IIIIIIIIII IIIIIIIIII
IIIIIIIIII IIIIIIIIII IIIIIIIIII IIIIIIIIII IIIIIIIIII
351 RAEMDVRNAI VLFSLSAAGV SQELAISRRQ GRVPGNLMGS YRSVEVEAKK
IIIIIIIIII IIIIIIIIII IIIIIHHHHH HHHHHHHHHH HHHHHHHHO
IIIIIIIIII IIIIIIIIII IIIIIHHHHH HHHHHHHHHH HHHHHHHHO
401 EGGSRPGGPG GIRSRLKPID ADTIDIYARA VFPAAFAAVN IIYWAAYTM
CPU time: 0.560000
Sequences examined: 1
Sequences written: 2
The first part of the output file contains a list of all the sequences searched and the predictions generated for a given sequence. When multiple predictions are generated for each sequence, the predictions are listed in order of prediction quality, with the best prediction on top and the sub-optimal predictions below.
Next to each sequence, the file contains the raw counts of how many transmembrane helices, inner loops, and outer loops were found. If you have generated more than one prediction per sequence, there is also a score reported for comparing the quality of the prediction with the best prediction for each sequence. This is a relative measure only and should not be used to compare the quality of predictions between different sequences. In general, a score of 10 or more indicates that the prediction is significantly different from the best prediction.
Following this list of sequences, TransMem displays a table listing the specific boundaries of each feature predicted, followed by the sequence aligned with the predicted labels.
TransMem takes any valid GCG specification for one or more protein sequences.
SPScan scans protein sequences for the presence of secretory signal peptides (SPs).
HTHScan scans protein sequences for the presence of helix-turn-helix motifs, indicative of sequence-specific DNA-binding structures often associated with gene regulation.
HelicalWheel plots a peptide sequence as a helical wheel to help you recognize amphiphilic regions.
PeptideStructure makes secondary structure predictions for a peptide sequence. The predictions include (in addition to alpha, beta, coil, and turn) measures for antigenicity, flexibility, hydrophobicity, and surface probability. PlotStructure displays the predictions graphically.
PepPlot plots measures of protein secondary structure and hydrophobicity in parallel panels of the same plot.
CoilScan locates coiled-coil segments in protein sequences.
TransMem only works on protein sequences.
TransMem is based upon a Hidden Markov Model (HMM) architecture. The architecture is made up of 7 types of states corresponding to the core of the transmembrane helix, helix caps, cytoplasmic loops, short and long cytoplasmic loop states, and globular domains that are part of each loop.
The states have a close relationship with the biology of membrane proteins; loop states connection to other loops through a helix cap, helix core, and another helix cap. These states correspond to one of three different labels, Inside (cytoplasmic), Helix (membrane spanning helix), and Outside (non-cytoplasmic).
The prediction of transmembrane helices is done by finding an optimal alignment of the sequence with the model using the N-Best algorithm. In the N-Best algorithm, the algorithm uses the model architecture to find the best labeling of the sequence, given the model.
Alternatively, you can run TransMem using the Viterbi algorithm, which finds the optimal alignment of the sequence with the model, then uses that alignment to read the labels. In general, the Viterbi algorithm will give the same results as the N-Best, but in some cases the predictions will differ.
The output of the raw probabilities is based upon the forward-backward algorithm, in which TransMem finds the probability of each labeling (Inside, Outside, or Helix) summed over all the possible alignments of the sequence to the model. Because these values are based upon all possible alignments of the model instead of a single optimal alignment, occasionally the raw probabilities will contradict the final labeling.
When no transmembrane helices are predicted, it is not a good idea to treat the Inside/Outside prediction as an accurate measure of whether or not the peptide is secreted. The inner and outer labeling is only meaningful for integral membrane proteins.
When using the N-Best algorithm, you can also choose to merge predictions with a given overlap. The boundaries of transmembrane helices have an experimental error of a few residues, a fact which was incorporated into the training of the model architecture. By allowing a merging of overlapping predictions, TransMem allows you to blur edges of the predicted helices, which in turn will cause the N-Best algorithm to generate predictions with significant differences.
The N-Best algorithm will always try to find the best labeling of your sequence that matches the parameters of minimum and maximum number of helices, even if this is not the best overall labeling. For example, if you have some other experimental evidence that suggests you are working with a 7 transmembrane protein, yet the algorithm gives you a prediction of 8 transmembrane helices, you can specify a minimum and maximum helix range of 7, which will force the algorithm to find this prediction. If the application is not able to find any matching predictions, try increasing the value of N-Best, which will increase the number of different predictions that the algorithm will consider.
By increasing the number of different predictions generated, you are increasing the number of different predictions that TransMem analyzes. Consequently, you may see weakly predicted helices that would otherwise not be visible, as well as many more false predictions. Additionally, if a helix is visible in a large number of predictions, it is more likely to be an actual helix and not a false positive. Since you are increasing the number of predictions considered, computation time will also increase dramatically with increases in the value for N-Best.
Because of the N-Best algorithm's ability to try to find a prediction that matches the restrictions, it may not be useful for screening protein sequences for a given number of transmembrane helices. Instead, we recommend using the Viterbi algorithm, which is more discriminating and runs faster.
If you have a sequence for which you have experimental evidence of a particular number of transmembrane helices, yet the algorithm does not predict the correct number, specify this number with -MINHelix and -MAXHelix, then try increasing the value for N-Best and the tolerance for merging overlapping predictions. In some cases, this will allow the algorithm to find the helices.
If you are screening large amounts of data for 7 transmembrane proteins, for example, it probably isn't a good idea to limit the search for predictions of only 7 transmembrane regions. Instead, more complete searches can be generated from searching for anything containing 6-8 transmembrane regions.
TransMem only recognizes transmembrane alpha helices. All other types of membrane spanning regions are not recognized.
A Since TransMem will produce a self-consistent topology prediction, if it misses any transmembrane helices, the topology will be wrong.
Predicted transmembrane helices in the n-terminal region sometimes turn out to be signal peptides.
A non-redundant data set of 148 sequences, composed of all known transmembrane proteins (Möller et al, 2001), was used for validation of this program. The data set was run through the public server and through this implementation.
All except 8 sequences showed identical results (95% identical). NB: When the predictions differed, this program found the other prediction as the second best answer.
Of these 8 differences, 4 (COX2_BOVIN, IMMA_CITFR, RCEL_RHOVI, and TCR2_ECOLI) only differed in the exact positions of the helix boundaries. All predicted helices from the two implementations overlapped by at least 16 residues, and the topology predictions were identical.
There were 2 of the 8 proteins (COXH_BOVIN and CYB_RHOSH) where the topology predictions of the two implementations were reversed in addition to minor helix boundary differences. This implementation was correct for COXH_BOVIN and the public server was correct for CYB_RHOSH.
In the final 2 sequences, (CITN_KLEPN and CYOB_ECOLI), the two predictions differed in the pressence or absence of a given TM helix. This implementation correctly found an additional helix in CITN_KLEPN. For CYOB_ECOLI, the public server correctly found an additional TM helix that this implementation did not find.
In conclusion, the two implementations are scientifically comparable. Half of the differences could be attributed to minor variation in TM helix boundaries, which are not significant differences, due to the inherent uncertainty in experimental determination of the helix boundaries. When the different implementations gave significant differences, there was an even split between which answer was correct.
All parameters for this program may be added to the command line. Use -CHEck to view the summary below and to specify parameters before the program executes. In the summary below, the capitalized letters in the parameter names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose parameter values that are optional. For more information, see "Using Program Parameters" in Chapter 3, Using Programs in the User's Guide.
Minimal Syntax: % transmem [-INfile=]sw:gad_mouse -Default
Prompted Parameters:
[-OUTfile=]gad_mouse.transmem names the output file
-NBest=1 Number of different annotations of each sequence
-PROXimity=0 Proximity of feature boundaries to consider annotations
equivalent
Optional Parameters:
-RSF[=transmem.rsf] save predicted domains as features in an RSF file
-MEthod=Nbest,Viterbi selects which method to use to generate the prediction.
By default, Nbest is selected. (selecting Viterbi
suppresses -NBest and -PROXimity)
-RAWProb writes out the raw probabilities of each label for
each sequence character
-MAXHelix=10 only show proteins with at most this many
transmembrane helices (default is unlimited)
-MINHelix=1 only show proteins with at least this many
transmembrane helices (default is 1)
-MONitor displays screen trace of progress
-NOSUMmary suppresses screen summary at the end of the program
You can set the parameters listed below from the command line. For more information, see "Using Program Parameters" in Chapter 3, Using Programs in the User's Guide.
Specify how many predictions you want to see. The most likely predictions is the first one listed. Note that larger numbers of predictions can greatly reduce program speed.
When using the N-Best algorithm, merge predictions that have this much overlap or less. This allows you to avoid lists of predictions that are not functionally different.
Use the Viterbi algorithm for predictions instead of the N-Best. This algorithm is faster than the N-Best, but can only generate a single prediction per sequence. If -VITerbi is used, the values for -NBest and for -TOLerance are ignored.
Output the raw probabilities for observing each label at each sequence character. These values are based upon the forward-backward algorithm and may not agree with the final predicted label.
Limit the output to include only proteins with this many transmembrane helices or fewer. By default, the maximum number of helices is unlimited.
Limit the output to include only proteins with this many or more transmembrane helices. If this value is greater than specified with -MAXHelix, the value for MAXHelix is used. By default, the output only includes proteins with one or more helix.
writes a summary of the program's work to the screen when you've used -Default to suppress all program interaction. A summary typically displays at the end of a program run interactively. You can suppress the summary for a program run interactively with -NOSUMmary.
You can also use this parameter to cause a summary of the program's work to be written in the log file of a program run in batch.
[ Program Manual | User's Guide | Data Files | Databases ]
Technical Support: support-us@accelrys.com
or support-eu@accelrys.com
Copyright (c) 1982-2002 Accelrys Inc. A subsidiary of Pharmacopeia, Inc. All rights reserved.
Licenses and Trademarks Wisconsin Package is a trademark and GCG and the GCG logo are registered trademarks of Accelrys Inc.
All other product names mentioned in this documentation may be trademarks, and if so, are trademarks or registered trademarks of their respective holders and are used in this documentation for identification purposes only.