Posts

Showing posts from November, 2012

Fetching data from Illumina BaseSpace

We are working to deploy Ray in Illumina BaseSpace . For our tests, we needed the data on our infrastructure in Québec City. First, I did a list of objects for 2x150bp Human Genome in Record Time with the HiSeq 2500 $ cat RawFiles.txt https://basespace.illumina.com/sample/262264/files/raw/sorted_S1_L001_R1_001.fastq.gz?id=25033024&appResultId= https://basespace.illumina.com/sample/262264/files/raw/sorted_S1_L001_R1_002.fastq.gz?id=25054488&appResultId= https://basespace.illumina.com/sample/262264/files/raw/sorted_S1_L001_R2_001.fastq.gz?id=25081698&appResultId= https://basespace.illumina.com/sample/262264/files/raw/sorted_S1_L001_R2_002.fastq.gz?id=25123588&appResultId= https://basespace.illumina.com/sample/262264/files/raw/sorted_S1_L002_R1_001.fastq.gz?id=25155266&appResultId= https://basespace.illumina.com/sample/262264/files/raw/sorted_S1_L002_R1_002.fastq.gz?id=25175449&appResultId= https://basespace.illumina.com/sample/262264/files/raw/sorted_S1_L0

The life cycle of a scientific manuscript

The manuscript: Sebastien Boisvert, Frederic Raymond, Elenie Godzaridis, Francois Laviolette and Jacques Corbeil Ray Meta: scalable de novo metagenome assembly and profiling Genome Biology For authors's contributions, see the paper on the publisher website. We started to work on this project on 2012-02-25 according to "git log". Date Journal Event 2012-06-07 Nature Genetics Manuscript submission 2012-06-12 Nature Genetics Editorial rejection 2012-06-15 Nature Methods Manuscript submission 2012-06-29 Nature Methods Editorial rejection 2012-07-05 Genome Research Manuscript submission 2012-07-10 Genome Research Editorial rejection 2012-07-11 Genome Biology Presubmission enquiry 2012-07-18 Genome Biology Positive editorial response to presubmission enquiry 2012-07-24 Genome Biology Manuscript submission 2012-09-18 Genome Biology

My accomplishments in 2012

Dear readers, 2012 was a busy year. 2012 accomplishments (starting with the most significant) Publication of a scientific paper in the really good journal Genome Biology (impact factor: 9.04) Ray Meta , a free open software (generalization of Ray heuristics for metagenomes) The RayPlatform framework Invited participation as an expert to Next Generation Sequence Analysis 2012 Invited seminar at Argonne National Laboratory Invited SciNet Developer Seminars ( for computer scientists ; for biologists and bioinformaticians) at SciNet in Toronto Technical guidance (for director Jacques Corbeil and codirector François Laviolette) for a  1 M$ Genome Canada application, 2012 Bioinformatics and Computational Biology Competition Creation of the mini-ranks hybrid programming model (with Fangfang Xia and Rick Stevens)  Editorial rejection at Nature Genetics (2012-06-12) Editorial rejection at Nature Methods (2012-06-29) Editorial rejection at Genome Research (2012-07-10)  

Table 1: Comparison of Ray instances with MPI ranks and mini-ranks.

Table 1: Comparison of Ray instances with MPI ranks and mini-ranks. Metric Ray with MPI ranks Ray with mini-ranks Accession SRA010766 SRA010766 Description Jay T. Flatley genome Jay T. Flatley genome Input files 164 164 Input sequences 1593032322 1593032322 Compression Bzip2 Bzip2 K-mer length 21 21 MPI implementation Open-MPI 1.6.2 Open-MPI 1.6.2 Compiler GNU g++ 4.7.0 GNU g++ 4.7.0 Ray version 2.1.0-pre-release 2.1.1-devel (9cbf2a8277) RayPlatform version 1.1.0-pre-release 7.0.0-devel (7e38d17e0f) Interconnect Mellanox MT26428 Mellanox MT2

Commands for Debian packaging

# build the .deb dpkg-buildpackage -r fakeroot # check the produced .deb lintian ray_2.1.0-1_amd64.deb # check the .dsc lintian ray_2.1.0-1.dsc # check the changes lintian ray_2.1.0-1_amd64.changes # add a upstream tarball pristine-tar commit

Cost Effectiveness Analysis (CEA) of running Ray on Amazon EC2

Sample: SRA001125 URL: http://trace.ddbj.nig.ac.jp/DRASearch/submission?acc=SRA001125 DNA reads: 34911784 (2 * 17455892) Read length (nt): 36 Technology: Illumina Genome Analyzer API name: m1.large 2 Ray processes Running time: 05:28:46 Pricing: 0.260 $ / h Cost: 1.560 $ API name: m3.xlarge 4 Ray processes Running time: 02:31:34 Pricing: 0.580 $ / h Cost: 1.730 $ API name: cc2.8xlarge 32 Ray processes Running time: 00:54:06 Pricing: 2.400 / h Cost: 2.400 $ Conclusions: 1. You get your results faster if you pay more. 2. For cc2.8xlarge, 33% (00:19:40) of the time was loading sequences from EBS. That's a lot ! 3. The scalability on this problem is not that good because the problem size is not very large. 4. Amazon EC2 is really affordable for de novo assemblies of bacterial genomes.        If you want to try these tests yourself => http://github.com/sebhtml/Ray-in-Amazon-EC2-CLOUD

Comparing fastq compression with gzip, bzip2 and xz

Storage is expensive. Compression is a lossless approach to reduce the storage requirements. Sébastien Boisvert 2012-11-05 Table 1: Comparison of compression methods on SRR001665_1.fastq -- 10408224 DNA sequences of length 36. Tests were on Fedora 17 x86_64 with a Intel Core i5-3360M processor and a Intel SSDSC2BW180A3 drive. Tests were not run in parallel. The time is the real entry from the time command. Each test was done twice. Compression Time Size (bytes) none 0 2085533064 (100%) time cat SRR001665_1.fastq | gzip -9 > SRR001665_1.fastq.gz 7m31.519s 7m20.340s 376373619 (18%) time cat SRR001665_1.fastq | bzip2 -9 > SRR001665_1.fastq.bz2 3m12.601s 3m25.243s 295000140 (14%) time cat SRR001665_1.fastq | xz -9 > SRR001665_1.fastq.xz 32m45.933s

Justification pour accéder à un calculateur Microsoft(R) Windows(R) HPC(R)

** Problématique Les séquençeurs d'ADN actuels (comme le Illumina(R) HiSeq(R) 2500) produit plus de 6 000 000 000 séquences d'ADN numériques de longueur entre 100 et 200 lettres (A, T, C, G) en une seule analyse. Un des types d'analyse possible est "l'assemblage de novo de génomes." ** Système logiciel Mon logiciel s'appelle Ray et est codé en C++ 1998 (Microsoft(R) Visual Studio(R) 2010 supporte complètement ce standard). Une librairie MPI de passage de message est aussi requise. MPICH2 et Open-MPI sont les deux disponibles en distribution binaire sous Microsoft(R) Windows(R). Ray est un logiciel libre (licence GPLv3) et utilise la librairie parallèle RayPlatform (licence: LGPLv3). - http://www.ohloh.net/p/ray-assembler - http://www.ohloh.net/p/rayplatform - http://denovoassembler.sourceforge.net/ Ray fait de l'assemblage de novo de génomes ou de métagénomes dans l'industrie des sciences de la vie (génomique). Ray "scale" très bien pour

New "mini-ranks" hybrid programming model.

Table 1: Comparison of MPI ranks with mini-ranks on the Colosse super-computer at Laval University. +-------+---------------------------------------------------+ | Cores | Average round-trip latency (us)                   | +-------+-----------------------+---------------------------+ |       | MPI ranks             | mini-ranks                | |       | (pure MPI)            | (MPI + pthread)           | +-------+-----------------------+---------------------------+ | 8     | 11.25 +/- 0           | 24.1429 +/- 0             | | 16    | 35.875 +/- 6.92369    | 43.0179 +/- 8.76275       | | 32    | 66.3125 +/- 6.76387   | 41.7143 +/- 1.23924       | | 64    | 90 +/- 16.5265        | 37.75 +/- 6.41984         | | 128   | 126.562 +/- 25.0116   | 43.0179 +/- 8.76275       | | 256   | 203.637 +/- 67.4579   | 44.6429 +/- 6.11862       | | 512   |                       |                           | +-------+-----------------------+---------------------------+