Minimises overhead for set up and processing of new projects.
We developed NGSANE , a Linux-based, HPC-enabled framework that minimises overhead for set up and processing of new projects yet maintains full flexibility of custom scripting and data provenance. NGSANE process raw sequencing data either on a local cluster or third party Compute Cloud (CSIRO, Amazon). It provides an interactive Web2.0-based presentation layer for quality inspection as well as intuitive tools to interrogate the resulting data.
Academic tools will remain the methods of choice for cutting-edge data analysis, however, most do not comply with even very basic software-development practice (i.e. poor documentation, lack of legacy â€“support), which makes setting-up and maintenance time consuming. A similar issue applies to reference data sets, which need to be downloaded and often filtered and converted into a usable format.
Summarizing quality control and data yield in a meaningful way remains a labor intensive expert task. Rather than individually battling these issues, a more efficient way would be to have a centralized system set up that is collectively maintained by the researchers who are using the system. Benefits would be:
Each project (Input) has a project specific config file (A) holding the necessary customizations for the planned analysis tasks. Note, each project can have multiple config files for each analysis task.
Distinct from the project is the NGSANE core, which contains the pan-project configuration file (header.sh B). This file contains general system variables, platform-specific parameters, and paths to the various software binaries installed on a system. It should be configured once upon initial installation, then modified whenever new software versions are required. Also in the core is the trigger.sh file (C), which is the main executable file in NGSANE. It processes the variables and tasks specified in the configuration files, ensuring that all dependencies are met and invoking the core job submission protocols. It allows the user to selectively launch a test or 'dry' run, a full high performance computing run, or generate a summary report once the tasks have completed. (D) The mod files contain the generic analytic pipelines that are to be executed on the HPC cluster. Each mod corresponds to a specific analysis, a single task, or a series thereof. They include checkpoints to recover previous failed executions, as well as comprehensive logging of each step. Advanced users can create customised mods and include them in the framework.
After execution, a concise summary of the results and a project-card (E, Figure 2) can be generated. This usually includes general statistics of the results, including graphs, potential errors, and an itemised log of the checkpoints for each task.
Processing genomic information at scale remains challenging due the large investment associated with compute hardware and IT personnel, which is a barrier to entry for small laboratories and difficult to maintain at peak times for larger institutes. This hampers the creation of time-reliable production informatics environments for clinical genomics. Commercial cloud computing frameworks, like Amazon Web Services (AWS) provide an economical alternative to in-house compute clusters as they allow outsourcing of computation to third-party providers, while retaining the software and compute flexibility. To cater for this resource-hungry, fast pace yet sensitive environment of personalized medicine we have created an Amazon Machine Image (AMI) for NGSANE.
NGSANE was already used to discover causal variants for ALS , Lynch Syndrome  as well as transcriptome analysis [5,6] and forward mutation experiments . NGSANE is available at https://github.com/BauerLab/ngsane/wiki
Genomic information is increasingly being used for medical research, giving rise to the need for efficient analysis methodology able to cope with thousands of individuals and millions of variants. The widely used Hadoop MapReduce architecture and associated machine learning library, Mahout, provide the means for tackling computationally challenging tasks[7,8]. However, many genomic analyses do not fit the Map-Reduce paradigm. We therefore utilized the recently developed Spark engine, which is less rigid in the parallelisation architecture and hence more suitable for bioinformatics tasks, along with its associated machine learning library, MLlib . We developed an interface from Spark and MLlib to the standard variant format (VCF), which opens up the usage of advanced, efficient machine learning algorithms to genomic data .
 Bauer DC, Gaff C, Dinger ME, Caramins M, Buske FA, Fenech M, Hansen D, Cobiac Lâ€œGenomics and personalised whole-of-life healthcareâ€Trends Mol Med. 2014 May 4. pii: S1471-4914(14)00062-8 PMID: 24801560.
 Buske FA, French HJ, Smith MA, Clark SJ, Bauer DC “NGSANE: Lightweight Production Informatics Framework for High Throughput Data Analysis” Bioinformatics (2014) 30 (10): 1471-1472 Jan 26. PMID: 24470576.
 Fifita JA, Williams KL, McCann EP, Oâ€™Brien A, Bauer DC, Nicholson GA, and Blair IP â€œMutation analysis of Matrin 3 in Australian familial amyotrophic lateral sclerosisâ€ Neurobiology of Aging 2014 Nov 20. pii: S0197-4580(14)00726-X. PMID:25523636.
 Talseth-Palmer BA, Bauer DC, Sjursen W, Evans TJ, McPhillips M, Proietto A, Otton G, Spigelman AD and Scott RJ, â€œTargeted next-generation sequencing identifies two Lynch syndrome families and a polygenic interaction that may cause cancer development in a third Lynch syndrome familyâ€ submitted
 Zong Hong Zhang, Dhanisha J. Jhaveri, Vikki M. Marshall, Denis C. Bauer, Janette Edson, Ramesh K. Narayanan, Gregory J. Robinson, Andreas E. Lundberg, Perry F. Bartlett, Naomi R. Wray, Qiongyi Zhao. (2014) â€œA comparative study of techniques for differential expression analysis on RNA-Seq data.â€ PLoS One, 2014 Aug 13;9(8):e103207. PMID: 25119138
 Barry G, Guennewig B, Fung S, Kaczorowski D, Weickert CS. Long Non-Coding RNA Expression during Aging in the Human Subependymal Zone. Front Neurol. 2015 Mar 9;6:45. doi: 10.3389/fneur.2015.00045. eCollection 2015. PubMed PMID: 25806019; PubMed Central PMCID: PMC4353253.
 Bauer DC, McMorran BJ, Foote SJ and Burgio G â€œGenome-wide analysis of chemically induced mutations in mouse in phenotype-driven screensâ€ submitted
 Szul, P., Arzhaeva, Y., Domanski, L., Lagerstrom, R., Bauer, D. O'brien, A., Wang, D., Nepal, S., Zic, J., Taylor, J., Bednarz, T., â€œPlatform for Big Data and Visual Analyticsâ€ eResearch Australasia 2014, Melbourne.
 Aidan R O'Brien, Fabian A Buske, and Denis C Bauer Scalable Clustering of Genotype Information using MapReduce, 2014, InCob, Sydney, Australia
 Oâ€™Brien A, Bauer D â€œVariantSpark: Applying Spark-based machine learning methods to genomic informationâ€, BigData 2015, Sydney
 Oâ€™Brien AR, Saunders N, Buske FA, Bauer DC, â€œPopulation Scale Clustering of Genotype Informationâ€ submitted