Chinese genomics giant BGI releases latest bioinformatics software and datasets

To provide a wide range of bioinformatics software and cloud-based computation green solutions for massive data generation and computation; large-scale datasets hosted in GigaDB with citable DOIs promote extremely rapid data release and dissemination

BGI has announced several bioinformatics analysis pipelines and software, including assembly and binning tools, genetic variation software, as well as two cloud-based green solutions for genomic-based research. In addition, GigaScience, an upcoming research journal published by BGI, announces the launch of its new, freely accessible, large-scale database: GigaDB. The launch of GigaDB is heralded by today's release of numerous large datasets of different types and from a variety of organisms. GigaDB is unique because it is directly affiliated with a journal and all of its datasets are assigned a Digital Object Identifier (DOI), which allows these data to be directly cited in future publications.

New Software and Pipelines

Today, on the first day of the "6th International Conference of Genomics" (ICG-6) hosted by BGI, the researchers reported the availability of and information on updated and newly available bioinformatics applications, pipelines, and tools. These include the Short Oligonucleotide Analysis Package (SOAP series etc.) and cloud-based software (Hecate 2, Gaea 2, GAMA, GSNP and Adam.) for Next-Gen data analysis, as well as others.

According to BGI's researchers, the updated SOAP series released today includes SOAP3, a GPU-accelerated short read alignment tool; SOAPindel, an indel finder; SOAPfusion, a gene fusion detector; SOAPsplice, a splice-junction detector; SOAPdenovo-Trans, a de novo transcriptome assembler; and Metacluster 4.0, a binning solving tool for metagenomics data. The SOAP toolkit is freely available at http://soap.genomics.org.cn.

Dr. Zhiyu Peng, Vice President of Research & Cooperation Division at BGI, gave a detailed introduction about SOAPsplice and SOAPfusion, which are two RNA-Seq data-based analytic tools designed specifically to detect splice junctions and gene fusions, respectively. Tests on SOAPsplice, using both simulated and real datasets, revealed its high sensitivity and high specificity. These qualities become more obvious under conditions of low sequencing depth. Analyses using SOAPfusion showed it currently has the highest sensitivity and lowest false discovery rate of all currently published gene-fusion detection tools.

In regard to these new tools, Dr. Peng stated that the "Emergence of the RNA-Seq technology provides unprecedented opportunities and accelerates the speed in the detection of fusion genes and splice junction sites. In particular, the gene fusion discovery performed by SOAPfusion provides an accurate and specific way which will greatly accelerate the study of genomic alternations in cancer as well as the therapeutic cancer studies."

SOAPdenovo-Trans is an assembler designed to handle alternative splicing and differing expression levels among transcripts for de novo transcriptome assembly using short RNA-Seq reads. Discussing this assembler, Dr. Yin Long Xie, Senior Bioinformatician of BGI, said, "We evaluated SOAPdenovo-Trans on samples of mouse and rice as the animal and plant models, and the results showed this assembler could provide a more accurate, complete and faster way to construct the transcript sets."

Another area that requires extensive next-gen data analysis is metagenome studies. Metagenomic data creates difficulties for researchers due to a fundamental computational problem – how to group together sequence reads from similar species – which is particularly relevant when carrying out binning. At the release conference, Prof. Sim-Ming Yiu from the University of Hong Kong gave a presentation on some existing solutions and Metacluster 4.0, the latest software tool, for providing an excellent means to solve this binning problem. According to Prof. Yiu, this tool is able to handle 100 species and at varying abundance ratios.

Cloud-based Green Solutions

With the rapid development of high-throughput sequencing technology over the past ten years, genomic studies have gradually become a standard approach in a wide range of research areas. Given that such research creates huge amounts of data, cloud computing is becoming a favorable solution for large-scale bioinformatic analysis, both in terms of resource utilization, flexibility, and efficiency, as well as time and cost savings for massive data generation and computation.

Many IT industries and large genomic organizations have been gradually shifting their analytical methods to use cloud-based green – more energy efficient – solutions for processing the enormous amounts of biological data. "With the cooperation with BGI, we have made many achievements in software development on green cloud computing." said Dr. Mian Lu from the Hong Kong University of Science and Technology, "A data processing pipeline has been re-implemented on GPU platform, and we have improved its efficiency: which could take only 6 hours to finish processing the data which needed 90 hours before."

One of the important green solutions that cloud computing provides is based on the extensively shortened computation times needed when using the software that is developed on specialized hardware. GSNP and GAMA are two discovery tools for genetic variation implemented on the GPU platform. GSNP is used to detect single-nucleotide polymorphisms, and GAMA is a software tool used to estimate allelic frequencies. Compared with its predecessor SOAPsnp, GSNP achieves higher performance through improved sparse representation for base information and the massive data parallelism on the GPU. Dr. Lu noted that, "Within about 2 hours, a former three days process on human genome, can be done using GSNP." The original version of GAMA could take up to a year or more to compute the allele frequencies for a group of 1,000 individuals, however, Dr. Lu noted that the new version of "GAMA can generate the result in two days."

Dr. Lu also talked about another tool called Adam that was "developed by exploiting hardware features, which could sort and remove duplicate from massive data. Its performance has been improved by three times, handling 150GB data with a node of 25GB memory," said Dr. Lu. For further information, about the new software and pipelines, please visit http://jil.genomics.org.cn.

In addition to their announcements on new software developments for specialized hardware, the BGI Bioinformatics Department also revealed their updated "flexible computing" solutions for de novo assembly and resequencing analyses: Hecate 2 and Gaea 2. Their original versions, Hecate 1 and Gaea 1, had been released in July of this year and had drawn significant attention worldwide from many biological researchers and news reporters.

In comparison with the former version, Hecate 2 has greater scalability than do the original algorithms, especially in terms of cost and time. "Hecate 2 adopts more sophisticated models for solving massive scale constraint optimization problems in de novo assembly in a fine-grained manner, which enables data from different sequencing platform to be assembled simultaneously and leads a dramatic improvement of the assembly quality in terms of accuracy, length and coverage," said Evan Xiang, R&D Director at the Flexible Computing Center of BGI.

Xiang also commented on Gaea 2, saying that it linearly increases processing speed with increasing cluster size and, added that, "the performance of Gaea 2 could surpass current available alignment software by aggregating their advanced functionalities into a unified cloud based solution."

GigaDB launched with release of additional 17 new large-scale datasets

GigaDB hosts publicly available, large-scale datasets and also provides every dataset with a unique DOI. A DOI enables researchers to specifically reference these datasets in independent publications where these data are used. GigaDB is associated with the journal GigaScience, an upcoming research journal published by BGI and BioMedCentral.

Today's launch of GigaDB is accompanied by the release of seventeen large datasets on top of those already hosted such as the genome of the recent deadly outbreak strain of E. coli O104. These datasets now span much of tree of life, with data hosted from plants, animals (vertebrate and invertebrate) and microbes. The plant data includes whole-genome data from the foxtail millet, the potato, the Chinese cabbage, the domestic cucumber, the pigeonpea, and sweet and grain sorghums. The animal data includes whole-genome data from three species of ants, a roundworm (Ascaris suum), the naked mole rat, the domestic sheep, domestic and wild silkworms, the Tibetan antelope, and three different datasets (whole genome, transcriptome, and methylome) from a single Asian man.

These data are all freely accessible and will be of great use for analyses being done in a wide range of life-science fields. The DOI issued to each dataset allows researchers to directly cite the data itself – as a separate entity from the data analysis papers. This is a major step in promoting extremely rapid data release. As data can now be cited directly, data producers can now be properly acknowledged and recognized for their work and no longer need to wait to release the data until a more extensive analysis paper has been written, reviewed, revised, and published. Additionally, DOIs make these data permanently accessible, easy to find and use, and available to replicate previous work. Five of these GigaDB newly released datasets illustrate the future of early data release: they are made available with a DOI, allowing the data producers to receive citable credit, for rapid use by the community before the analysis papers are published. The analysis paper for the sorghum genome has recently been accepted in Genome Biology and is expected to be published later this month, demonstrating a new gold standard of placing a dataset citation in the references where it can be easily tracked.