|
April 21, 2003
Bioinformatics experts gain ground in protein
sequence analysis
By Tim Stephens
Proteins, with their extraordinary diversity of structure
and function,
pose some of the toughest problems in the field of
bioinformatics, giving
rise to a growing arsenal of computational tools for
protein analysis.
|

|
| Most proteins have highly
complex three-dimensional
shapes. This is an image of human growth hormone.
Image:
Swiss Institute of Bioinformatics |
An array of computer-based strategies is now available to
help molecular
biologists who have found an unknown protein, determined
its sequence
of amino acid subunits, and want to know its
three-dimensional structure
and biological function.
Computational techniques alone may not provide all the answers, but
they are powerful enough to have earned a place in the
standard toolkit
for protein research. The Sequence Alignment and Modeling
System (SAM),
introduced in the early 1990s by UCSC researchers, has become one of
the most popular software packages for the analysis of
protein sequences.
SAM now faces stiff competition, but UCSC researchers keep improving
the software and are working on other software programs to complement
it. Both academic researchers and commercial companies are among the
users of the SAM software.
"We have licensed the SAM software to more than 200
academic research
groups and about 20 commercial companies. We also have a web server
that sees over 1,000 uses per week for protein structure
prediction,"
said Richard Hughey, professor and chair of computer engineering at
UCSC.
The list of companies that have licenced the SAM software from UCSC
reads like a Who's Who of the biotechnology industry:
Affymetrix, Celera,
Genentech, Novartis, Pfizer, and Pharmacia, among others.
While commercial
companies must pay a fee to use the software (as much as $125,000),
academic licenses are free, Hughey said.
Proteins carry out most of the crucial functions of living
cells. They
are typically large molecules with very complex shapes.
Their structural
and functional diversity surpasses that of any other kind
of molecule.
Enzymes, antibodies, hormones, muscle, tendons, cartilage, hair, and
feathers are all made of proteins.
At the simplest level, proteins are long chains of subunits called
amino acids. There are 20 different amino acids, and their sequence
in the linear chain of a protein molecule ultimately determines its
structure. Sections of the molecule may twist into coils or fold into
sheets, and the entire protein folds into a precise and often highly
complex three-dimensional structure.
Software programs such as SAM take advantage of the
structural similarities
of related proteins and the existence of large databases of
information
on known proteins. Proteins that share a common ancestor
have many similarities
in their amino acid sequences. These similarities make it possible to
create statistical models of families of related proteins. A software
program can compare an unknown protein's sequence with such
statistical
models and may be able to predict the protein's structure
based on its
similarity to known proteins.
SAM uses a statistical technique known as Hidden Markov
Models (HMMs),
first introduced to the field of bioinformatics by David
Haussler, holder
of the UC Presidential Chair in computer science and
director of UCSC's
Center for Biomolecular Science and Engineering. The SAM software was
initially developed by Haussler, postdoctoral researcher
Anders Krogh,
now at the University of Copenhagen, and others. Haussler
later focused
on DNA sequence analysis, and further development of the SAM software
was taken over by Hughey and Kevin Karplus, professor of
computer engineering.
SAM has a history of success in an unusual series of group
experiments
performed every two years to establish the state of the art
in protein
structure prediction. The Fifth
Community Wide Experiment on the Critical Assessment of
Techniques for
Protein Structure Prediction (CASP5) concluded in December 2002.
The top performers in one category of the CASP5 experiment
were "metaservers"
that combined several different servers, including SAM, and
looked for
agreement between different methods, Karplus said. "The success
of the metaservers was somewhat unexpected--these automatic methods
outperformed most human predictors," he said.
UCSC entered two versions of SAM (SAM-T99 and a newer
version, SAM-T02)
in CASP5, as well as a new program Karplus is developing
called Undertaker.
Undertaker is designed to predict protein folding based on
the tendency
for parts of a protein molecule that are
hydrophobic--literally "water-fearing"--to
be buried inside the structure where they won't come in contact with
water.
"The burial of hydrophobic residues is one of the main driving
forces in protein folding, and Undertaker is an attempt to use that
to predict new folds," Karplus said.
He found that the combination of Undertaker with SAM did not perform
as well as SAM alone on the easier targets, where there was
a good alignment
of the unknown sequence with a known template. The combined programs
did surprisingly well, however, on some of the hardest
problems, Karplus
said.
"Where our methods had started failing in the past,
that's where
we started succeeding," he said. "We still have a
lot of work
to do, but I think we can improve even more over the next
year."
Return to
Front Page
|