AEHRC Cloud Based Technologies

Minimises overhead for set up and processing of new projects.


We developed NGSANE [2], a Linux-based, HPC-enabled framework that minimises overhead for set up and processing of new projects yet maintains full flexibility of custom scripting and data provenance. NGSANE process raw sequencing data either on a local cluster or third party Compute Cloud (CSIRO, Amazon). It provides an interactive Web2.0-based presentation layer for quality inspection as well as intuitive tools to interrogate the resulting data.
NGSANE Software overview diagram

Crowd-sourcing not Wheel-reinvention

Academic tools will remain the methods of choice for cutting-edge data analysis, however, most do not comply with even very basic software-development practice (i.e. poor documentation, lack of legacy –support), which makes setting-up and maintenance time consuming. A similar issue applies to reference data sets, which need to be downloaded and often filtered and converted into a usable format.

Summarizing quality control and data yield in a meaningful way remains a labor intensive expert task. Rather than individually battling these issues, a more efficient way would be to have a centralized system set up that is collectively maintained by the researchers who are using the system. Benefits would be:

  • Sharing modular methods/scripts for data analysis and summary
  • Optimized task-packaging for efficient HPC resource utilization
  • Ensuring consistency and reproducibility by keeping scripts separate from data
  • Benchmarking quality amongst other datasets within your organization
  • Enabling collaborative knowledge gain
  • Making developers’ expert knowledge available to users by enforcing scripts to have a self-contained quality control stage

How it works

Each project (Input) has a project specific config file (A) holding the necessary customizations for the planned analysis tasks. Note, each project can have multiple config files for each analysis task.

ngsane software flowchartDistinct from the project is the NGSANE core, which contains the pan-project configuration file ( B). This file contains general system variables, platform-specific parameters, and paths to the various software binaries installed on a system. It should be configured once upon initial installation, then modified whenever new software versions are required. Also in the core is the file (C), which is the main executable file in NGSANE. It processes the variables and tasks specified in the configuration files, ensuring that all dependencies are met and invoking the core job submission protocols. It allows the user to selectively launch a test or 'dry' run, a full high performance computing run, or generate a summary report once the tasks have completed. (D) The mod files contain the generic analytic pipelines that are to be executed on the HPC cluster. Each mod corresponds to a specific analysis, a single task, or a series thereof. They include checkpoints to recover previous failed executions, as well as comprehensive logging of each step. Advanced users can create customised mods and include them in the framework.

After execution, a concise summary of the results and a project-card (E, Figure 2) can be generated. This usually includes general statistics of the results, including graphs, potential errors, and an itemised log of the checkpoints for each task.

NGSANE in the cloud

Processing genomic information at scale remains challenging due the large investment associated with compute hardware and IT personnel, which is a barrier to entry for small laboratories and difficult to maintain at peak times for larger institutes. This hampers the creation of time-reliable production informatics environments for clinical genomics. Commercial cloud computing frameworks, like Amazon Web Services (AWS) provide an economical alternative to in-house compute clusters as they allow outsourcing of computation to third-party providers, while retaining the software and compute flexibility. To cater for this resource-hungry, fast pace yet sensitive environment of personalized medicine we have created an Amazon Machine Image (AMI) for NGSANE.

NGSANE applications

NGSANE was already used to discover causal variants for ALS [3], Lynch Syndrome [4] as well as transcriptome analysis [5,6] and forward mutation experiments [7]. NGSANE is available at


Genomic information is increasingly being used for medical research, giving rise to the need for efficient analysis methodology able to cope with thousands of individuals and millions of variants. The widely used Hadoop MapReduce architecture and associated machine learning library, Mahout, provide the means for tackling computationally challenging tasks[7,8]. However, many genomic analyses do not fit the Map-Reduce paradigm. We therefore utilized the recently developed Spark engine, which is less rigid in the parallelisation architecture and hence more suitable for bioinformatics tasks, along with its associated machine learning library, MLlib [9]. We developed an interface from Spark and MLlib to the standard variant format (VCF), which opens up the usage of advanced, efficient machine learning algorithms to genomic data [10].

VariantSpark Software Flowchart

[1] Bauer DC, Gaff C, Dinger ME, Caramins M, Buske FA, Fenech M, Hansen D, Cobiac L“Genomics and personalised whole-of-life healthcare”Trends Mol Med. 2014 May 4. pii: S1471-4914(14)00062-8 PMID: 24801560.
[2] Buske FA, French HJ, Smith MA, Clark SJ, Bauer DC “NGSANE: Lightweight Production Informatics Framework for High Throughput Data Analysis” Bioinformatics (2014) 30 (10): 1471-1472 Jan 26. PMID: 24470576.
[3] Fifita JA, Williams KL, McCann EP, O’Brien A, Bauer DC, Nicholson GA, and Blair IP “Mutation analysis of Matrin 3 in Australian familial amyotrophic lateral sclerosis” Neurobiology of Aging 2014 Nov 20. pii: S0197-4580(14)00726-X. PMID:25523636.
[4] Talseth-Palmer BA, Bauer DC, Sjursen W, Evans TJ, McPhillips M, Proietto A, Otton G, Spigelman AD and Scott RJ, “Targeted next-generation sequencing identifies two Lynch syndrome families and a polygenic interaction that may cause cancer development in a third Lynch syndrome family” submitted
[5] Zong Hong Zhang, Dhanisha J. Jhaveri, Vikki M. Marshall, Denis C. Bauer, Janette Edson, Ramesh K. Narayanan, Gregory J. Robinson, Andreas E. Lundberg, Perry F. Bartlett, Naomi R. Wray, Qiongyi Zhao. (2014) “A comparative study of techniques for differential expression analysis on RNA-Seq data.” PLoS One, 2014 Aug 13;9(8):e103207. PMID: 25119138
[6] Barry G, Guennewig B, Fung S, Kaczorowski D, Weickert CS. Long Non-Coding RNA Expression during Aging in the Human Subependymal Zone. Front Neurol. 2015 Mar 9;6:45. doi: 10.3389/fneur.2015.00045. eCollection 2015. PubMed PMID: 25806019; PubMed Central PMCID: PMC4353253.
[6] Bauer DC, McMorran BJ, Foote SJ and Burgio G “Genome-wide analysis of chemically induced mutations in mouse in phenotype-driven screens” submitted
[7] Szul, P., Arzhaeva, Y., Domanski, L., Lagerstrom, R., Bauer, D. O'brien, A., Wang, D., Nepal, S., Zic, J., Taylor, J., Bednarz, T., “Platform for Big Data and Visual Analytics” eResearch Australasia 2014, Melbourne.
[8] Aidan R O'Brien, Fabian A Buske, and Denis C Bauer Scalable Clustering of Genotype Information using MapReduce, 2014, InCob, Sydney, Australia
[9] O’Brien A, Bauer D “VariantSpark: Applying Spark-based machine learning methods to genomic information”, BigData 2015, Sydney
[10] O’Brien AR, Saunders N, Buske FA, Bauer DC, “Population Scale Clustering of Genotype Information” submitted