New computational methods allow for accurate determination of gene expression

Sarah Small
March 24, 2021

UNIVERSITY PARK, Pa. — A more accurate measurement and interpretation of gene activities, using large volumes of sequencing data, may be possible with a new computational framework and set of algorithms currently being developed by Penn State researchers.  A five-year, $1.85 million grant from the National Institutes of Health is funding the research led by Mingfu Shao, the Charles K. Etner Early Career Assistant Professor in the School of Electrical Engineering and Computer Science.

To understand how the machinery of a cell works, researchers frequently use RNA-sequencing (RNA-seq), which captures and measures the messenger RNA molecules (mRNAs) — also called transcripts — in cells. Because mRNAs copy and carry the genetic information of genes, measuring mRNAs is an efficient and accurate way to quantitively measure the gene activities. As such, researchers often use RNA-seq to study gene functions and cell machinery. 

However, RNA-seq can only gather fragments of mRNAs, rather than full-length molecules. 

“We need to computationally reconstruct the full-length sequences of the mRNAs from those short fragments,” said Shao. “This is called assembly. A challenge, while also an opportunity, is that currently hundreds of thousands of RNA-seq samples have been stored in various repositories. Can we assemble them together? This is called meta-assembly. It means we want to assemble many, many samples together, instead of only one. We try to make use of the shared information across all those samples to improve the assembly accuracy.”

While accurate meta-assembly has not yet been achieved with current computational models, Shao has been developing a new framework to allow for many samples to be assembled at once, which would give researchers a clear understanding of the entire story told by the RNA-sequencing data. 

The framework starts with multiple individual RNA-seq samples organized with a graph structure, called splice graphs. Several of these splice graphs are combined to generate another source of information called phasing paths. 

“These phasing paths are very helpful in capturing the critical splicing information in individual samples,” Shao said. “After merging the graph and generating the phasing path, we then decompose this combined graph into a set of paths. And each path will represent a predicted mRNA. This is the novel framework.”

According to Shao, the more complete, accurate and data-driven reconstruction of transcriptomes, which are the set of transcripts in a cell, could improve downstream RNA-seq analysis such as expression quantification and differential analysis. Researchers also would use the developed methods to study normal and diseased tissues and then identify the specific RNAs in the disease samples, which could then be used as biomarkers to help with diagnosis.

Shao said that because of the wide-spread use of transcriptomes in biomedical and biological research, he is excited about the myriad potential uses.

“Large-scale RNA-seq data has been deposited, and most of them are made publicly available to researchers,” Shao said. “So, it’s exciting that with our framework, we’ll now have scalability. It will really enable the assembly of tens of thousands of samples at the same time. We expect that our developed methods together with existing data could have high impact on biological and biomedical research.”

In addition to meta-assembly, another direction Shao’s group is exploring is the development of allele-specific assembly, allowing researchers to determine which expressed variant comes from which parent and perhaps mitigate genetic diseases. 

“Current RNA-seq assembly methods don’t distinguish between alleles, but maybe we could produce allele-specific assembly, for example, to tell people, ‘Those mRNAs are from the maternal side, and those are from the paternal side,’” Shao said.

Shao’s research also includes the reconstruction of mRNAs expressed in a single cell, instead of in a tissue, using so-called single-cell RNA-sequencing data. This single-cell work is partly supported by a $400,000 grant from the National Science Foundation.

(Media Contacts)

Last Updated March 29, 2021