Kepler Community Spotlight: Accelerating Evolutionary Genomics with Subrata Mishra

Tracing evolutionary paths of resistance in different bacterial lifestyles from large population genomic data- When is de novo evolution important?

 

A Case Study in AI-Accelerated Genomics:

Filtering high-quality, meaningful mutations from large population genomic datasets is a painstaking process — often taking weeks of manual curation and generating numerous intermediate files. From fine-tuning filtering parameters to pinpointing the specific mutations linked to observed phenotypes, each step demands time and precision.

With Kepler AI, this process has been transformed. What once took weeks now takes just minutes. Here’s a case study showcasing how Kepler accelerates and simplifies the analysis of complex genomic data — turning data overload into clear biological insight.

The Challenge:

Microbial populations display two common lifestyles: free-living (planktonic) cells and surface-attached/aggregated communities (biofilms). Biofilm formation is a stress response and planktonic to biofilm transition occurs due to stressors like environmental nutrient limitation and fluctuating energy availability. While we know both types of lifestyles can allow bacteria populations to acquire different evolutionary trajectories, it is unclear if the upregulation of stress response genes in biofilm populations imposes an “evolutionary reserve” i.e., a reduced opportunity for further adaptive mutations compared with planktonic populations exposed to the same fluctuating starvation stress. To test this scenario, we exposed bacterial cells of three types – planktonic cells, a mix of both planktonic and biofilm cells, and only biofilm forming cells to slow degrees of fluctuating starvation. The evolved cells states were then sequenced generating large population genomics data. I have used two studies for reference for this case study (Study 1) and (Study 2) to see the types of comparative data analysis that can be achieved.

Working with large population genomics datasets poses several significant challenges. One of the foremost difficulties lies in selecting appropriate tools for variant calling — different software pipelines (e.g., GATK, FreeBayes, bcftools) vary in accuracy, computational demands, and compatibility, making the choice both technical and strategic. The workflow also generates a vast number of intermediate files, often consuming considerable storage space and complicating file management. Running multiple FASTQ files simultaneously, whether on local systems or high-performance computing (HPC) clusters, can be restricted by memory limits, job scheduling constraints, or software dependencies, further slowing progress. Additionally, visualizing and analyzing such massive datasets in R or other statistical tools is computationally intensive and time-consuming, often requiring multiple optimization steps. Finally, deriving biologically meaningful inferences from the data adds another layer of complexity, as distinguishing genuine evolutionary signals from noise demands rigorous statistical validation and careful interpretation.

Kepler’s 5-Phase Solution:

Phase 1: Quality control and generation of BAM files  for the three types of bacterial communities

The workflow began by just downloading input FASTQ files- Kepler recommends tools for population genomic datasets and carries out the entire pipeline for obtaining a list of variants. This eliminates the storage of huge intermediate files and weeks of planning and execution.

Input: Simple FASTQ files

Steps: Quality control, alignment, variant calling by multiple tools

Key output: BAM files for all three bacterial community types

Phase 2: VCF files for the three types of bacterial communities

Input: Simple BAM files

Steps: Variant calling by multiple tools for large genomic datasets

Key output: VCF files for all three bacterial community types

Phase 3: Narrowing down to high quality, relevant mutations

Usually, the choice of filtering parameters, the generation of subsequent files can be perplexing and time taking. Kepler sorted and filtered 1000’s of mutations in the VCF files using multiple parameters in a matter of minutes.

Input: VCF files

Steps: Quality controls, depth parameters.

Key output: Filtered VCF files for all three bacterial community types

Phase 4: Data analysis of the filtered VCF files

Kepler uses the filtered vcf files for extensive data analysis. From the number of mutations, allele frequencies, overlap among mutations, nature of mutations, and pathways affected to clustering of the three populations, Kepler provides an in-depth data analysis.

Phase 5: Significance and interpretation of mutation signatures

Finally, the AI integrates all the data to make important inferences and conclusions to the study.

Key Achievements

This case study demonstrates several breakthrough achievements:

  • Speed: Complete workflow in minutes versus traditional weeks-long timelines
  • Insightful for fundamental science in understanding bacterial resistance and evolutionary trajectories
  • End-to-End Automation: From raw data directly to final data files
  • Full Transparency: Every step auditable through Kepler's replay functionality

Connect the dots with Kepler

Try out Kepler today, or book a call with us about your organization's use case

Connect the dots with Kepler

Try out Kepler today, or book a call with us about your organization's use case