Genomic information is increasingly being used for medical research, giving rise to the need for efficient analysis methodology able to cope with thousands of individuals and millions of variants. The widely used Hadoop MapReduce architecture and associated machine learning library, Mahout, provide the means for tackling computationally challenging tasks[1,2]. However, many genomic analyses do not fit the Map-Reduce paradigm. We therefore utilized the recently developed Spark engine, which is less rigid in the parallelisation architecture and hence more suitable for bioinformatics tasks, along with its associated machine learning library, MLlib [3]. We developed an interface from Spark and MLlib to the standard variant format (VCF), which opens up the usage of advanced, efficient machine learning algorithms to genomic data [4].

[1] Szul, P., Arzhaeva, Y., Domanski, L., Lagerstrom, R., Bauer, D. O’brien, A., Wang, D., Nepal, S., Zic, J., Taylor, J., Bednarz, T., “Platform for Big Data and Visual Analytics” eResearch Australasia 2014, Melbourne.
[2] Aidan R O’Brien, Fabian A Buske, and Denis C Bauer Scalable Clustering of Genotype Information using MapReduce, 2014, InCob, Sydney, Australia
[3] O’Brien A, Bauer D “VariantSpark: Applying Spark-based machine learning methods to genomic information”, BigData 2015, Sydney
[4] O’Brien AR, Saunders N, Buske FA, Bauer DC, “Population Scale Clustering of Genotype Information” submitted