Harvard researcher develops program to read any genome sequence, decipher its genetic code

Yekaterina “Kate” Shulgina was a first-year student in the Graduate School of Arts and Sciences, looking for a short computational biology project so she could check the requirement of her program in systems biology. She wondered how genetic code, once thought to be universal, could evolve and change.

That was 2016 and today Shulgina has come out the other end of that short-term project with a way to decipher this genetic mystery. She describes it in a new paper in the journal eLife with Harvard biologist Sean Eddy.

The report details a new computer program that can read the genome sequence of an organism and then determine its genetic code. The program, called Codetta, has the potential to help scientists expand their understanding of how the genetic code evolves and correctly interpret the genetic code of newly sequenced organisms.

“This is of itself is a very fundamental biology question,” said Shulgina, who does her graduate research in Eddy’s Lab.

The genetic code is the set of rules that tells the cells how to interpret the three-letter combinations of nucleotides into proteins, often referred to as the building blocks of life. Almost every organism, from E. coli to humans, uses the same genetic code. It’s why the code was once thought to be set in stone. But scientists have discovered a handful of outliers — organisms that use alternative genetic codes – exist where the set of instructions are different.

This is where Codetta can shine. The program can help to identify more organisms that use these alternative genetic codes, helping shed new light on how genetic codes can even change in the first place.

“Understanding how this happened would help us reconcile why we originally thought this was impossible… and how these really fundamental processes actually work,” Shulgina said.

Already, Codetta has analyzed the genome sequences of over 250,000 bacteria and other single-celled organisms called archaea for alternative genetic codes and has identified five that have never been seen. In all five cases, the code for the amino acid arginine was reassigned to a different amino acid. It’s believed to mark the first time scientists have seen this swap in bacteria and could hint at evolutionary forces that go into altering the genetic code.

The researchers say the study marks the largest screening for alternative genetic codes. Codetta essentially analyzed every genome that’s available for bacteria and archaea. The name of the program is a cross between the codons, the sequence of three nucleotides that form pieces of the genetic code, and the Rosetta Stone, a slab of rock inscribed with three languages.

The work marks a capstone moment for Shulgina, who spent the past five years developing the statistical theory behind Codetta, writing the program, testing it, and then analyzing the genomes. It works by reading the genome of an organism and then tapping into a database of known proteins to produce a likely genetic code. It differs from other similar methods because of the scale at which it can analyze genomes.

Shulgina joined Eddy’s lab, which specializes in comparing genomes, in 2016 after coming to him for advice on the algorithm she designed to interpret genetic codes.

Until now, no one has done such a broad survey for alternative genetic codes.

“It was great to see new codes because for all we knew, Kate would do all this work and there wouldn’t turn out to be any new ones to find,” said Eddy, who’s also a Howard Hughes Medical Investigator. He also noted the potential of the system to be used to ensure the accuracy of the many databases that house protein sequences.

“Many protein sequences in the databases these days are only conceptual translations of genomic DNA sequences,” Eddy said. “People mine these protein sequences for all sorts of useful stuff, like new enzymes or new gene-editing tools and whatnot. You’d like for those protein sequences to be accurate, but if the organism is using a nonstandard code, they’ll be erroneously translated.”

The researchers say the next step of the work is to use Codetta to search for alternative codes in viruses, eukaryotes, and organellar genomes like mitochondria and chloroplasts.

“There’s still a lot of diversity of life where we haven’t done this systematic screening yet,” Shulgina said.

Harvard researcher develops program to read any genome sequence, decipher its genetic code

Supercomputers power evolutionary insight: From house sparrows to conservation strategies

Supercomputing advances the quest to resolve the Hubble tension in cosmology

Supercomputing’s next frontier: NVIDIA, CoreWeave unite to build the AI factories of tomorrow

Supercomputing drives materials breakthrough for green computing: 3D graphene-like electronic behavior unlocks new low-energy electronics

Supercomputing reveals hidden galactic architecture around the Milky Way

ML, supercomputing unite to revolutionize high-power laser optics

Supercomputers unravel the mystery of missing Tatooine-like planets

Supercomputers illuminate deep Earth: How giant 'blobs' shape our magnetic shield

Glasgow sets its sights on 'cognitive' cities, where urban systems learn, predict, adapt

How big can a planet be? Supercomputing unlocks the secrets of giant worlds

UVA unveils the power of AI in accelerating new treatment discoveries

New study tracks pollution worldwide

AI meets DNA: Scientists create custom gene editors with machine learning

Darkening oceans: New study reveals alarming decline in marine light zones

At SC25, Phison pushes AI storage to Gen5 speeds, brings AI agents to everyday laptops

Supercomputers unlock the chemistry of gecko binding: Vienna team breaks new ground in modeling large molecules

WSU study pinpoints molecular weak spot in virus entry; supercomputing helps reveal the hidden dance

Harnessing the fury of plasma turbulence: Supercomputer simulations illuminate fusion’s next frontier

Big numbers, big bets: Dell scales up HPC for the AI era

HMCI, Rapt.ai deploy NVIDIA GB10 systems to power Rancho Cordova’s new AI & Robotics Ecosystem

Largest Computational Biology Simulation Mimics The Ribosome

Silicon 'neurons' may add a new dimension to chips

Linux Networx Accelerators Expected to Drive up to 4x Price/Performance

Complex Concepts That Really Add Up

Blue Sky Studios Donates Animation SuperComputer to Wesleyan

Humanities, HPC connect at NERSC

TeraGrid ’09 'Call for Participation'

Turbulence responsible for black holes' balancing act

Cray Wins $52 Million SuperComputer Contract

SDSC Researchers Accurately Predict Protein Docking

EMAIL NEWSLETTER SUBSCRIPTION

Harvard researcher develops program to read any genome sequence, decipher its genetic code