Genomic information is increasingly being used for medical research, giving rise to the need for efficient analysis methodology able to cope with thousands of individuals and millions of variants. The widely used Hadoop MapReduce architecture and associated machine learning library, Mahout, provide the means for tackling computationally challenging tasks[1,2]. However, many genomic analyses do not fit the Map-Reduce paradigm. We therefore utilized the recently developed Spark engine, which is less rigid in the parallelisation architecture and hence more suitable for bioinformatics tasks, along with its associated machine learning library, MLlib . We developed an interface from Spark and MLlib to the standard variant format (VCF), which opens up the usage of advanced, efficient machine learning algorithms to genomic data .
 Szul, P., Arzhaeva, Y., Domanski, L., Lagerstrom, R., Bauer, D. O’brien, A., Wang, D., Nepal, S., Zic, J., Taylor, J., Bednarz, T., â€œPlatform for Big Data and Visual Analyticsâ€ eResearch Australasia 2014, Melbourne.
 Aidan R O’Brien, Fabian A Buske, and Denis C Bauer Scalable Clustering of Genotype Information using MapReduce, 2014, InCob, Sydney, Australia
 Oâ€™Brien A, Bauer D â€œVariantSpark: Applying Spark-based machine learning methods to genomic informationâ€, BigData 2015, Sydney
 Oâ€™Brien AR, Saunders N, Buske FA, Bauer DC, â€œPopulation Scale Clustering of Genotype Informationâ€ submitted