Reproducibility through workflow sharing
Scientists often wish to replicate a previously published result either using the original dataset or using their own dataset. However, replication studies are often costly to set up to the point where reproducibility is not done. Currently, the scientist would have to understand the method described in the publication for that study, then download, configure, and run all the appropriate software. The software must be installed in the user’s lab and in some instances execution may require access to computing resources such as clusters. Some of the software tools used in the prior study may be proprietary, so either they would be purchased or else open software with similar function would have to be identified. Finally, the analysis tools must be set up correctly, so all their particular parameters should be configured appropriately for the data at hand. Often times, the authors must be contacted to clarify assumptions that were omitted from the publication. Ideally, a system would be already set up so that the end-to-end analysis software used in a previous study can be readily re-executed. Moreover, if any characteristics of the new dataset are incompatible with the analysis, the system should alert the user to this and assist them in adapting the method to their new dataset.
Reproducibility in Biomedical Research
We investigated the use of a web portal with shared workflows to replicate previously published studies. We collaborated with Dr Crhis Mason of Cornell University and with members of the NIH-funded Center for Genomic Studies on Mental Disorders (CGSMD) to develop a suite of workflows for genomic analysis, including population studies, inter- and intra-family studies, and next generation sequencing. Our initial results are the easy replication of previously published findings using workflows. The original published studies did not use a workflow system, and all the software components were executed by hand. We replicated the results of a previous genome-wide association study [Duerr et al 06] that found a significant association between the IL23R gene on chromosome 1p31 and Crohn’s disease. To reproduce the result, we used one of the workflows we already had created for association test, which uses the Cochran-Mantel-Haenszel (CMH) association statistic to do an association test conditional on the processing done in the population stratification step. The software components used in this workflow include codes from the widely-used Plink and R software packages. We obtained the data of the original study from the authors. The run time of the workflow was 19.3 hours. Most workflow components took minutes to execute. We also replicated another previously published result for CNV association [Bayrakli et al., 2007] for early-onset Parkinson’s disease. Using one of our workflows for CNV detection, we saw a CNV at the expected locus over the PARK2 gene (chr6:146,350,000-146,520,00) that was found on the original study. Our workflow had 16 steps and run in 34 minutes.
Some observations that we make from these replication studies are:
- Workflow systems enable efficient set up of analyses. The replication studies took seconds to set up. There was no overhead incurred in downloading or setting up software tools, or reading documentation, or typing commands to execute each of the steps of the analysis.
- A library of carefully crafted workflows of select state-of-the-art methods will cover a very large range of genomic analyses. The workflows that we used to replicate the results were independently developed and unchanged. They were designed with no notion of the original studies, but instead they were targeting datasets on mental disorders.
- It is important to abstract the conceptual analysis being carried out away from the details of the execution environment. The software components used in the original studies were not the same than those in our workflows. In the original study for Crohn’s disease, the CMH statistic was done with the R package, and the rest of the steps were done with R and the FBAT software, while our workflow used CMH and the association test from Plink and the plot-ting from R. Our workflows are described in an abstract fashion, independent of the specific software components executed. In the original Parkinson’s study, the Circular Binary Segment (CBS) algorithm for CNV detection was used, while our workflow used a state-of-the-art method that combines evidence from three newer algorithms. Our workflows thus contain more prescient methods that can be readily applied.
- Semantic constraints can be added to workflows to avoid analysis errors. The first workflow that we submitted with the original Crohn’s disease study dataset failed. Examining the trace and the documentation of the software for association test we realized that no duplicate individuals can be present. Upon manual examination we discovered that there were three duplicated individuals in the dataset. We removed them by hand and the workflow executed with no problems. The workflow now includes a constraint that the input data for the association test cannot contain duplicate individuals, which results in the prior step having a parameter set to remove duplicates. The time savings to future users could be significant.
Through this work, we found that access to medical datasets is challenging, limiting the wider reuse of workflows in our library. Biology data has less privacy concerns and is broadly accessible, so we turned our efforts in that direction.
Reproducibility in Biology Research
We have a collaboration with Dr. Philip Bourne of the Skaggs School of Pharmacy at the University of California San Diego, who has been interested in workflow reuse for some years and has written editorials about it in his journal, PLoS Computational Biology [Bourne 10]. Their group has been focusing on systematic approaches to drug discovery. Currently, drug discovery is a slow, serendipitous process that could be significantly accelerated through systematic search and method reuse. They have developed a method to derive the drug-target network of an organism, i.e., its “drugome.” The method involves analyzing the proteome of a given organism against all FDA-approved drugs. The process uncovers protein receptors in the organism that could be targeted by drugs currently in use for other purposes. The development of the drugome for tuberculosis using this method took two years, and unveiled several drugs that were unknown to have an effect on the disease. We have created workflows that capture this complex method and that can be easily run on other organisms, have published as open web linked data [Garijo and Gil 11], and have preliminary results that quantify the effort saved by using this method [Garijo et al 11]. Elsevier complemented the funds under this NSF award with funds to support a student intern in our group to work in this area, demonstrating that scientific publishers are aware of the value of capturing the software apparatus of computational methods that are currently described informally in published articles.
As part of this research, we created abstract workflows that describes the method steps in a way that is independent of the specific software tools and implementations used. To export the abstract workflow as linked open data, we created an extension (profile) of OPM (described in Section 2.1 of this report) called OPMW. The abstract and executed workflows were published using the OPM standard as Linked Open Data, so that the entire workflow and its components can be accessed as web objects. This means that each dataset (initial, intermediate, and final), each workflow component (scripts, software tools, data conversions), and the entire workflow structure are accessible as individual and interlined web objects.
Our approach is significant and transformational. First, it makes workflows free of particular workflow systems or execution environments. Second, it makes data free of particular data catalogs or repositories, as datasets and their metadata become directly accessible as web objects. Third, it makes the scientific methods published in the literature free of particular software tools or implementations, as the abstract workflow separates the general method from the particulars of the execution environment that appear in the executed workflow.
Details about this project are described here.
We will continue to work in this area beyond this project, as the ideas are getting traction in the workflow community, the provenance community, the scientific publishing community, and the future of scientific scholarship. The potential to change scientific practice is enormous.
This work is reported in the following publications:
* “From Data to Knowledge to Discoveries: Scientific Workflows and Artificial Intelligence.” Yolanda Gil. Scientific Programming, 17(3), 2009.
* “A New Approach for Publishing Workflows: Abstractions, Standards, and Linked Data.” Daniel Garijo and Yolanda Gil. Proceedings of the Sixth Workshop on Workflows in Support of Large-Scale Science (WORKS'11), held in conjunction with Supercomputing, Seattle, Washington, November 2011.