17 Mar 2003
Biol 6312
Prediction of Protein Structure
1. Overview
Key References:
Protein structure prediction in the postgenomic era
[Review article]
David T Jones
Current Opinion in Structural Biology 2000, 10:371-379. MedlineSchonbrun J, Wedemeyer WJ, Baker D.
Protein structure prediction in 2002.
Curr Opin Struct Biol. 2002 Jun;12(3):348-54.
There continues to be a large gap between the number of proteins of known amino acid sequence and the number of proteins of known 3-d structure. This gap may ever be eliminated.
But protein structure is essential for understanding the function of the protein:
Can the prediction of protein structure from sequence be improved enough to eliminate the need to crystallize each protein?
Probably not, in the near future, but predictions can generate useful and generally realiable information
There are 3 levels of analysis in the overall prediction scheme;
- Motif recognition in the primary sequence
- Secondary structure prediction
- Teriary structure/fold prediction
A starting point is often to search for proteins with sequences that are similar to the protein under study. This usually involves a BLAST search:
BLAST (Basic Local Alignment Search Tool)
Altschul SF, Madden TL, Schaffer AA, Zhang JH, Zhang Z, Miller W, Lipman DJ:
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.
Nucleic Acids Res 1997, 25: 33893402. MEDLINE Full text
From this analysis one might learn about the function of similar proteins, and whether a 3-d structure exists for any of them. Depending on the degree of amino acid identity of the closest "neighbors", one might surmise the function of the protein.
Protein Motif Recognition
Prosite is a database of sequence motifs for functions such as:
post-translational modifications of N- or C-termini, signal sequences for localization, sites of lipid attachment, sites of phosphorylation, or markers of particular types of enzymes
Example: phosphorylase kinase KRKQISVR Reference: Kemp & Pearson (1990) TiBS 15:342-346
PROSITE Website
To scan a protein sequence against the Protsite database go here
There are databases of sequence motifs. These can generate aligned sequences from their websites.
PRINTS a Protein Fingerprint database
Prediction of Secondary Structure
In some cases this is preliminary to prediction of tertiary structure.
a) Stastical methods, first developed by Peter Chou and Gerald Fasman
Based on the tendencies of particular amino acid to be found in the different types of secondary structure. They are considered to be about 60% accurate.
Server for Chou-Fasman Prediction
Original references are too early for on-line abstracts etc.
Chou PY, Fasman GD.
Empirical predictions of protein conformation.
Annu Rev Biochem. 1978;47:251-76. Review.Chou PY, Fasman GD.
Prediction of protein conformation.
Biochemistry. 1974 Jan 15;13(2):222-45.Chou PY, Fasman GD.
Conformational parameters for amino acids in helical, beta-sheet, and random coil regions calculated from proteins.
Biochemistry. 1974 Jan 15;13(2):211-22.
Garnier (GOR1) Predicter of secondary structure
Garnier J, Osguthorpe DJ, Robson B.
Analysis of the accuracy and implications of simple methods for predicting the secondary structure of globular proteins.
J Mol Biol. 1978 Mar 25;120(1):97-120.Garnier J, Gibrat JF, Robson B.
GOR method for predicting protein secondary structure from amino acid sequence.
Methods Enzymol. 1996;266:540-53.Combination approach:
Kloczkowski A, Ting KL, Jernigan RL, Garnier J.
Combining the GOR V algorithm with evolutionary information for protein secondary structure prediction from amino acid sequence.
Proteins. 2002 Nov 1;49(2):154-66.
b) Neural net predictions of Secondary Structure:
They use a training set of known proteins, and usually utilize multiple sequence alignments during the prediction. They are considered to be 70-80% accurate.
PredictProtein (Now at Columbia U.)
Rost B, Schneider R, Sander C.
Progress in protein structure prediction?
Trends Biochem Sci. 1993 Apr;18(4):120-3.
PsiPred David T. Jones
Jones DT:
Protein secondary structure prediction based on position-specific scoring matrices.
J Mol Biol 1999, 292: 195202. Full text MEDLINE
Membrane-spanning segments of integral membrane proteins can be predicted in similar ways. We will discuss this later.
Prediction of Tertiary Structure
a) ab initio methods are those based on physical principles, e.g. energetics. They assume that a protein will fold to the lowest available, or accessible, free energy state.
b) modeling by homology to a similar protein of known structure.
2. Prediction of Secondary Structure
A. Statistical methods for the prediction of Secondary Structure
Chou-Fasman Method
the tendencies of each type of amino acid to be found in each of the 3 types of secondary structure (helix, sheet, loop) are calculated from a database of high-resolution structures (2 Å):
Original database, 1974 contained 15 proteins (2473 amino acids)
Revised, 1978, containing 29 proteins (4741 amino acids)
Simply increasing the database did not increase the accuracy of the predictions.
is the propensity for Ala to be found in an alpha-helix
Where the i's correspond to the amino acid type (20) and j's are the secondary structures (3)
e.g.:
= the number of Ala residues found in alpha-helix (in the database)
= the number of Ala residues (in the database)
= the number of amino acids found in alpha-helix (in the database)
= the number of amino acid residues in the database
So, the top of the propensity expression represents the fraction of all Ala residues that are alpha-helical,
while the bottom represents the fraction of ALL residues that are alpha-helical. Therefore, this ratio represents the tendency of each amino amino relative to the average amino acid.
A propensity of >1 indicates more likely than chance, while <1 indicates less likely than chance. A value of 1.0 indicates no prediction.
Since secondary structure is usually formed by several consecutive residues, it is more meaningful to take running averages of 5 or more amino acids at a time. This is called the window. A prediction will tend to be most accurate when the window matches the size of the actual segment of secondary structure.
The entire length of the protein is analyzed for each set of secondary structure propensities (helix, sheet, turn). The final prediction is made by comparing the 3 sets of values.
These calculations can be done by a spreadsheet using the "LOOKUP" function (called in Excel). This will be demonstrated later for transmembrane spans.
Special applications:
1) Regions that are highly likely to be alpha helical or beta-sheet are candidates for conformational changes.
2) The effects of mutations can be predicted by changing the sequence.
3) Turns can be predicted (with fair accuracy) on a residue basis, because they are not extended structures
Limitations: Accuracy of this method seems to be limited because of the limited range of propensities. Most types of amino acids can be found often in any secondary structure. Few are really excluded from certain secondary structures.
Extension of this approach. Position specific propensities. This works well with turns or with the termini of helices/sheets.
Protein Sci 1994 Dec;3(12):2207-16
A revised set of potentials for beta-turn formation in proteins.
Hutchinson EG, Thornton JM
B. Neural net predictions
Turns can be predicted this way:
Protein Sci 1999 May;8(5):1045-55
Prediction of the location and type of beta-turns in proteins using neural networks.
Shepherd AJ, Gorse D, Thornton JMIn general these predictions are obtained from servers through the world wide web.
PredictProtein from EMBL or Columbia University
PredictProtein (Now at Columbia U.)
Rost B, Schneider R, Sander C.
Progress in protein structure prediction?
Trends Biochem Sci. 1993 Apr;18(4):120-3.
Petersen TN, Lundegaard C, Nielsen M, Bohr H, Bohr J, Brunak S, Gippert GP, Lund O.
Prediction of protein secondary structure at 80% accuracy.
Proteins. 2000 Oct 1;41(1):17-20. MEDLINE
3. Predictions of Tertiary Structure
These sites offer www access to predictions of tertiary structure:
3D-PSSM Imperial Cancer Research Fund, London
Fugue University of Cambridge
SAM-T02 UC Santa Cruz
CAFASP: Critical Assessment of Fully Automated Structure Prediction
Meta Prediction Server
CASP4 analysis:
Samudrala R, Levitt M.
A comprehensive analysis of 40 blind protein structure predictions.
BMC Struct Biol. 2002 Aug 1;2(1):3.
Marcotte EM, Pellegrini M, Thompson MJ, Yeates TO, Eisenberg D:
A combined algorithm for genome-wide prediction of protein function.
Nature 1999, 402: 8386. MEDLINE
Bowie JU, Lüthy R, Eisenberg D:
A method to identify protein sequences that fold into a known three-dimensional structure.
Science 1991, 253: 164170. MEDLINE
Jones DT:
GenTHREADER: an efficient and reliable protein fold recognition method for genomic sequences.
J Mol Biol 1999, 287: 797815. MEDLINE
Moult J, Hubbard T, Fidelis K, Pedersen JT:
Critical assessment of methods of protein structure prediction (CASP): round III.
Proteins 1999, S3: 26. MEDLINE (No abstract)
Fischer D, Barret C, Bryson K, Elofsson A, Godzik A, Jones D, Karplus KJ, Kelley LA, MacCallum RM, Pawowski K et al.: CAFASP-1: critical assessment of fully automated structure prediction methods.
Proteins 1999, S3: 209217 MEDLINE
In general there are two approaches
a) Try to model an amino acid sequence by homology or by compatibility to known structures
Identification of topological fold is often the goal.
b) Try to fold an amino acid sequence based on physical principles
A) Modelling Approach
Look for sequences that have >30% sequence identity with a protein of known structure
(Sequences of 15-30% identity can be attempted)
Basic principles:
1) Buried amino acid residues are hydrophobic
2) Surface amino acid residues are polar
3) Within a family of homologous proteins, buried and active site residues are conserved.
4) Within a family of homologous proteins, surface residues are variable.
5) Elements of secondary structure will be more highly conserved than amino acid sequence.
3 steps in the procedure
1) Sequence alignment
2) Build sequence into secondary structure
3) Energy minimize to improve tertiary structure
If no homologous protein can be identified by sequence comparisons, the compatibility of the a.a. sequence of the target can be determined for representations of all known folds (templates).
This is called threading.
Example:
A 3D-1D Substitution Matrix for Protein Fold Recognition that Includes Predicted Secondary Structure of the
Sequence
J Mol Biol 1997 Apr 11;267(4):1026-38 (Full text)
A 3D-1D substitution matrix for protein fold recognition that includes predicted secondary structure of the sequence.
Rice DW, Eisenberg D
They have derived a scoring matrix from a database of 119 pairs of proteins of known structure with the same fold, but with <30% sequence identity.
882 elements = 7 x 3 x 2 x 7 x 3
7 classes of amino acids (Cys; Trp; Arg,Lys; Tyr,Phe; Ile,Leu,Val,Met; Ala,Gly,Ser,Thr,Pro; Asp,Glu,Asn,Gln,His)
3 types of secondary structure (helix, sheet, turn)
2 locations (buried, exposed)
and in the target sequence 7 classes of a.a. and 3 types of secondary structure (from PredictProtein)
First: obtain secondary structure prediction from PredictProtein.
Second, Calculate score for each of the 119 folds:
Example:
The highest score is for a Trp, predicted to be in helix, that matches a buried Trp in a helix____Score=4.5
A basic residue predicted to be in a sheet that matches an exposed basic residue in a sheet___Score=2.3
The same basic residue that matches an exposed basic residue in a helix would score -9 (Lowest score)
Swiss-Model: The EXPASY server will build a model of a protein that is at least 30% identical to a protein in the protein data bank. Click the "First Approach Mode" at the left
2. Ab initio approach (physical principles)
This approach can work even in the absence of homology to known structures, but overall the reliability is low.
LINUS is one example: Local Independently Nucleated Units of Structure
50 amino acids are folded at a time, in an overlapping fashion: 1-50, 26-75, ...
It is based on the idea that actual proteins fold by forming local secondary structure first.
Side chains are simplified. Only 3 interactions are used:
1 repulsive: steric
2 attractive: H-bonds and hydrophobic
Then the calculation of all possibilities for the search of the lowest free energy
Proteins 1995 Jun;22(2):81-99
LINUS: a hierarchic procedure to predict the fold of a protein.
Srinivasan R, Rose GD
Proc. Natl. Acad. Sci. USA Vol. 96, Issue 25, 1425814263, December 7, 1999 (Full text)
A physical basis for protein secondary structure
Rajgopal Srinivasan and George D. Rose
Srinivasan R, Rose GD.
Ab initio prediction of protein structure using LINUS.
Proteins. 2002 Jun 1;47(4):489-95.
Review:
Hardin C, Pogorelov TV, Luthey-Schulten Z.
Ab initio protein structure prediction.
Curr Opin Struct Biol. 2002 Apr;12(2):176-81. Review.
De novo prediction: Rosetta
Bonneau R, Strauss CE, Rohl CA, Chivian D, Bradley P, Malmstrom L, Robertson T, Baker D.
De novo prediction of three-dimensional structures for major protein families.
J Mol Biol. 2002 Sep 6;322(1):65-78.
Fully automated 3-d structure prediction:
Bystroff C, Shao Y.
Fully automated ab initio protein structure prediction using I-SITES, HMMSTR and ROSETTA.
Bioinformatics. 2002 Jul;18 Suppl 1:S54-61
Server prediction:
Comments/questions: svik@mail.smu.edu
Copyright 2003, Steven B. Vik, Southern Methodist University