Computing the Grammar of Life by Robert Irion

UCSC Review Summer-Fall 1994

Biologists call upon computers to help manage and analyze reams of information about life's genetic code.

All written things in the English language, from great works of literature to sleazy tabloid articles, are just combinations of the 26 letters, A to Z. In a similar way, all varieties of human life spring from a special "language" in our cells--a genetic code built from only four "letters." Translating this language of life, however, is not as easy as it might seem.

The four letters are the four different chemical building blocks that join together to make strings of DNA and RNA. Biologists label DNA's units with the letters A, C, G, and T, while RNA (because of one chemical change) consists of A's, C's, G's, and U's. Quite literally, the letters of DNA compose an operating manual for life; when read in order, they instruct the RNA and other machinery in the cell how to manufacture the proteins we need to survive.

As recently as a decade ago it was a chore to determine the sequence of letters for long strands of DNA or RNA. In the late 1970s, for example, biology professor Harry Noller's group at UCSC needed about two years to figure out all 1,542 letters of one RNA molecule and put them in the right order. Now, thanks to machines that zip through sequences automatically, a graduate student could do it in a day. But that sequence is a flyspeck compared to the 3 billion units of DNA in every human cell--our complete genetic code. That information would fill 200 copies of the Manhattan telephone directory. Researchers at the Human Genome Project hope to spell out all 3 billion letters by the year 2005, an ambitious effort that could shed light on thousands of genetically linked diseases.

Such projects crank out far too many letters for biologists to simply read or organize by hand. Instead, the researchers use powerful computer programs. Some programs act as fancy storage and retrieval tools. Others are more elaborate: They rifle through huge files of DNA or RNA letters at the speed of silicon, find bits of scientific interest, and even make predictions about the functions or shapes of the molecules which the letters represent.

"In biology, we have always had in vitro and in vivo experiments," says Saira Mian, a postdoctoral researcher under Noller. "Now, we have in silico."

At UCSC, Professor David Haussler and several faculty and students at the Baskin Center for Computer Engineering and Information Sciences are hip deep in RNA sequences. The team creates programs to tackle a fascinating problem: Given a long list of letters from an RNA molecule, can one predict how the molecule will fold? That knowledge is vital, because a molecule's shape often dictates its specific task in the cell. Biologists can use lab techniques to unravel some of these folding patterns, but to do so for lots of molecules is impractical. That's where computers come in.

To calculate the twists and turns of RNA letters in the cell, programmers invoke the rules of language itself. "We use actual grammars to describe sequences of RNA," says Haussler, who has written reports on this work with Noller and Mian. "Computer models can find patterns in these sequences, and the patterns obey a number of linguistic rules."

Just as English is not a mishmash of its 26 letters, the 4 letters of DNA and RNA do not line up randomly. Only certain combinations are allowed, because cells must make precisely the right protein "words" with no spelling errors. That "grammar," if defined in a program, helps the computer scour the letters for signposts--such as RNA punctuation marks that tell the ribosome when to start making a protein and when to stop.

The most useful patterns in strings of RNA letters have English relatives: palindromes. Word lovers know that a palindrome reads the same forward and backward, such as A MAN, A PLAN, A CANAL: PANAMA. An RNA palindrome is harder to recognize. It must follow this biological rule: When a piece of RNA folds, A's almost always hook up with U's, while C's pair with G's. Thus, GACA-UGUC is an RNA palindrome, because the letters in one direction complement those in the other. It can fold in half to make four pairs, like this: (graphic available from UC Santa Cruz's Public Information Office.)

A program that spots such palindromes could predict which parts of the molecule will fold up. It's a bit tougher than that, since RNA pairs do not always follow the simple rules and palindromes can nest within each other in tangled ways. Even so, Haussler's team has used programs to predict the correct two-dimensional folding patterns for bits of RNA about 100 letters long. For bigger strands, such as Noller's 1,542-letter monster, it still takes too much computer time. Haussler believes that faster computers of the future, with more sophisticated programs, may predict the full 3-D shapes of RNA, such as this small molecule: (graphic available from UC Santa Cruz's Public Information Office.)

No matter how good they become, Haussler notes, such programs will never measure up to billions of years of evolutionary fine-tuning. "The delicacy and complexity of what goes on in the cell every second is staggering," he says. "It makes our computer approaches look primitive."

--Robert Irion