machine learning for population-scale whole genome data

Genomic information is increasingly being used for medical research, giving rise to the need for efficient analysis methodologies able to cope with thousands of individuals and millions of genomic variants. We developed VariantSpark, a machine learning analysis framework for genomic data, utilizing the BigData Spark engine to enable real-time analysis.

VariantSpark is developed for ‘big’ (many samples, n) and ‘wide’ (large feature vector per sample, p) data. It was tested on datasets with n=3000 samples each containing p=80Million features in either unsupervised clustering approaches (e.g. kmeans) or supervised applications with target/truth values that are categorical (classification) or continuous (regression). Though VariantSpark was originally developed for genomic variant data, where p={0,1,2}, it can cater for any feature-based dataset, e.g. methylation, transcription, and even non-biological applications (IoT, customer data,…).


VariantSpark is available as stand-alone Spark-based framework (GitHub) to can be run on an on-premises Spark cluster. Furthermore, we also provide instruction for how to spin up and run VariantSpark on AWS Elastic Map-Reduce Cluster (EMR cluster). VariantSpark is also available to run on Databricks as a notebook through AWS or Azure.


  • Supervised Learning, specifically Random Forest, which allows
    • Feature selection: GWAS with multi-variate analysis for whole genome rare variants
    • Classification: Disease prediction from genomic profiles
  • Unsupervised Learning, specifically Kmeans clustering, which allows
    • Phenotype-genotype clustering
    • Patients-like-mine approaches using whole genome information

Publications and Media

  • Publications:
    • Aidan R. O’Brien, Neil F. W. Saunders, Yi Guo, Fabian A. Buske, Rodney J. Scott and Denis C. Bauer, VariantSpark: population scale clustering of genotype information BMC Genomics 2015 16:1052