How research computing empowers researchers, driving science and innovation


Posted November, 2021

Kohane Lab Genome Analysis for UDN project

Written by: Gaby Rodriguez-Reillo and Amir Karger

The Undiagnosed Diseases Network (UDN) is a national network aiming to diagnose the most cryptic and unsolved genetic disorders in the country. Comprising 12 medical research sites - including the Department of Biomedical Informatics (DBMI) at Harvard Medical School - UDN provides bioinformaticians a large and unique repertoire of sequencing data. Shilpa Kobren from DBMI’s Kohane Lab had whole exome sequencing (WES) data from 1425 samples and whole genome sequencing (WGS) data from 3623 samples, representing 398 tebibytes of data.

The data sets were generated at multiple sequencing centers, with varied technologies, and aligned to different human genome builds. To allow such high-level (i.e., cohort-level) comparisons, the data needed to be harmonized. The harmonization required significant amounts of computer memory to run, and it was arduous to shepherd thousands of samples through the complex analysis pipeline. Kobren was concerned that even the prodigious resources available through Amazon Web Services would not be sufficient to complete the project given time and budget constraints.

Kobren contacted HMS Research Computing (RC) to take advantage of the HMS computational resources and consulting available to the HMS community. Members of RC were able to help her in a number of areas:

  • Research Data Management team advised how to optimize data workflows and identify storage needs to provision appropriate storage for housing the research data
  • Quickly setting up licenses for the Sentieon analysis software on the O2 high performance compute (HPC) cluster
  • Easy access to a pre-installed set of bioinformatics tools like bcftools and htslib
  • RC’s home-grown pipeline runner, rcbio, a tool that streamlines running complex pipelines over many samples by tracking failed jobs and rerunning as needed.
    • Kobren said, “I cannot rave enough about” the pipeline. “I was able to code these steps up in a single script... all tracking and job dependencies were taken care of and I received emails for successful and failed jobs. On AWS, the time to code and run these steps would have been significantly longer.”
  • Research Computing Consultants: Counseled Kobren on how to best set up jobs and optimize O2’s resources
  • Several weeks of increased resources - higher job priority on the O2 cluster and doubling of “scratch” disk space - allowed Kobren to run 600 jobs at a time.

The combination of HPC resources, time-saving tools, and consulting allowed Kobren to complete the analysis in time and successfully present at the UDN conference:

“Thanks to O2 and RC’s ongoing and brilliant help, we were able to get 2924 WGS samples jointly-called, annotated, and quality controlled in time for our important July 14th presentation. ... Folks were very excited about the cohort search functionality and about having a uniform UDN dataset to work with.”- Shilpa Kobren