Mr. Poltavskiy Yaroslav
9350 Wilshire Boulevard Suite 203
Beverly Hills, California 90212
as you know my laboratory at the Mayo Clinic is studying somatic mosaicism in human cell and for this we continuously generate and analyze whole genome sequencing data for individual cells, clonal colonies, and various tissues in human body. Some results of our analyzes have already been published in respected scientific journals (Abyzov et al., Nature, 2012, PMID:23160490; McConnell et al., Science, 2017, PMID:28450582; Abyzov et al., Genome Research, 2017, PMID:28235832), while others are still under review. Until now we have acquired about 50 Tb of raw data (i.e., reads in gzipped fastq files), and generated a similar amount of processed data (i.e., alignment in bam format). I would state the obvious, but we were eager to have more efficient compression of the raw data to reduce the cost of storage for the data.
My lab has explored few options for lossless data compression and chose to use ALAPY, as it provided us with the optimal combination of compression, performance, and easiness of use. Particularly, the compression by ALAPY was almost twice and 2.5 times more efficient as compared to BZIP and GZIP accordingly, while being only slightly slower. Minor modifications to our data processing pipelines were necessary to make to handle data input from ALAPY files; but that was practically effortless. More importantly, we have not detected any slowness in processing or differences in the final processed files.
We now routinely use ALAPY in my lab and recommend it for use in large scale sequencing projects and wherever efficient storage of sequencing reads in fastq format is necessary with no affect on downstream data processing and analysis. I myself see a strong advantage in using ALAPY compression for data sharing and distribution.
Alexej Abyzov, Ph.D.
Senior Associate Consultant
Assistant Professor of Medical Informatics