Zhu’s research delves into the problems created by the vast increase in DNA sequencing that has turned the biology field into a data-intensive science.
“Over the past decade, DNA and RNA sequencing has become quick, easy and inexpensive,” Zhu said. “Sequencing has become indispensible for basic biological research and is increasingly serving biomedical diagnostics, trait association studies, gene expression analysis, drug resistance and other areas. All of these fields use sequences in different ways and are drowning in sequence data.”
Zhu said database overloads make it difficult to compute and analyze genome sequence data efficiently.
“As the sizes of the genome sequence databases grow, their computational demands are outpacing existing computing capacity,” he said. “This makes it even more difficult to complete an analysis. Primary data analysis now costs significantly more than generating the data in the first place.”
The research project focuses on a variety of approaches that use fixed-length strings/subsequences (called “k-mers”) from genome sequences. Although researchers have given considerable attention to efficient indexing, storage and retrieval for large-scale k-mer sets over the past decade, most existing techniques work in a computer with a huge main memory, which is not readily available to many biology labs. In addition, most techniques are optimized for exact matches, which limit the efficient sequence analysis applications.
“Most existing methods for storing k-mers do not support multiple word lengths ,” Zhu said. “For many sequence analysis problems, including assembly, variant detection and error correction, the use of multiple word lengths would allow better sensitivity and provide for more accurate sequence analysis.”
To overcome these issues, Zhu and his colleagues are investigating techniques for storing and querying large k-mer data sets. They will develop new data structures, building strategies, search algorithms and performance models.
“We expect to produce efficient on-disk approaches for storing and querying large-scale genome sequence databases,” Zhu said. “The research results will also impact other popular application areas such as biometrics, image processing, social network, E-commerce—any field where non-ordered, discrete, multi-dimensional data is crucial.”
Zhu has many years of research experience in the database field, including developing centralized/distributed database systems. Visit his website for additional information about his research experience and interests.