SCIENCE
ReadDB Provides Efficient Storage for Mapped Short Reads
The advent of
high-throughput sequencing has enabled sequencing based measurements of
cellular function, with an individual measurement potentially consisting of
more than 10^8 reads. While tools are available for aligning sets of reads to
genomes and interpreting the results, fewer tools have been developed to
address the storage and retrieval requirements of large collections of aligned
datasets.
We present ReadDB, a network accessible column store database system for
aligned high-throughput read datasets.
Results: ReadDB stores collections of aligned read positions and provides a
client interface to support visualization and analysis. ReadDB is implemented
as a network server that responds to queries on genomic intervals in an
experiment with either the set of contained reads or a histogram based interval
summary.
Tests on datasets ranging from 10^5 to 10^8 reads demonstrate that ReadDB
performance is generally within a factor of two of local-storage based methods
and often three to five times better than other network-based methods.
Conclusions: ReadDB is a high-performance foundation for ChIP-Seq and RNA-Seq
analysis. The client-server model provides convenient access to compute cluster
nodes or desktop visualization software without requiring a shared network
filesystem or large amounts of local storage.
The client code provides a simple interface for fast data access to
visualization or analysis. ReadDB provides a new way to store genome-aligned
reads for use in applications where read sequence and alignment mismatches are
not needed.