Toil enables reproducible, open source, big biomedical data analyses

J Vivian, AA Rao, FA Nothaft, C Ketchum… - Nature …, 2017 - nature.com
J Vivian, AA Rao, FA Nothaft, C Ketchum, J Armstrong, A Novak, J Pfeil, J Narkizian…
Nature biotechnology, 2017nature.com
9.128 (Jones and Bartlett Publishers, 2006). 6. Storch, M. et al. BASIC: a new biopart
assembly standard for idempotent cloning provides accurate, singletier DNA assembly for
synthetic biology. ACS Synth. Biol. 4, 781–787 (2015). open sharing of protocols. With a
precise ontology to describe standardized protocols, it may be possible to share methods
widely and create community standards. We envisage that in future individual research
laboratories, or clusters of colocated laboratories, will have in-house, low-cost automation …
9.128 (Jones and Bartlett Publishers, 2006). 6. Storch, M. et al. BASIC: a new biopart assembly standard for idempotent cloning provides accurate, singletier DNA assembly for synthetic biology. ACS Synth. Biol. 4, 781–787 (2015). open sharing of protocols. With a precise ontology to describe standardized protocols, it may be possible to share methods widely and create community standards. We envisage that in future individual research laboratories, or clusters of colocated laboratories, will have in-house, low-cost automation work cells but will access DNA foundries via the cloud to carry out complex experimental workflows. Technologies enabling this from companies such as Emerald Cloud Lab (S. San Francisco, CA, USA), Synthace (London) and Transcriptic (Menlo Park, CA, USA) could, for example, send experimental designs to foundries and return output data to a researcher. This ‘mixed economy’should accelerate the development and sharing of standardized protocols and metrology standards and shift a growing proportion of molecular, cellular and synthetic biology into a fully quantitative and reproducible era.
To the Editor: Contemporary genomic data sets contain tens of thousands of samples and petabytes of sequencing data1–3. Pipelines to process genomic data sets often comprise dozens of individual steps, each with their own set of parameters4, 5. Processing data at this scale and complexity is expensive, can take an unacceptably long time, and requires significant engineering effort. Furthermore, biomedical data sets are often siloed, both for organizational and security considerations and because they are physically difficult to transfer between systems, owing to bandwidth limitations. The solution to better handling these big data problems is twofold: first, we need robust software capable of running analyses quickly and efficiently, and second, we need the software and pipelines to be portable, so that they can be reproduced in any suitable compute environment. Here, we present Toil, a portable, opensource workflow software that can be used to run scientific workflows on a large scale in cloud or high-performance computing (HPC) environments. Toil was created to include a complete set of features necessary for rapid large-scale analyses across multiple environments. While several other scientific workflow software packages6–8 offer some subset of fault tolerance, cloud support and
nature.com