Bachelor's thesis

The concept of bioinformatics represents the convergence of explosive growth biotechnology and information technology [Boguski, 1998]

In recent decades, based on these two technological revolutions, a new paradigm for genome representation and analysis has emerged: the gene-oriented approach.

Pangenomic content relies on homology relationships, grouping genes into families across multiple genomes. This approach has found clinical applications such as drug and vaccine target discovery [Serruto et al. 2009] and the analysis of pathogenic behavior in epidemics [Holt et al. 2008].

The goal of this thesis was to develop a scalable and parallel methodology for computing gene sequence homology within bacterial pangenomes.

The work is based on PanDelos, one of the most accurate tools for pangenome content discovery [Bonnici et al. 2018]. A new high-performance implementation was developed in C++, using alignment-free techniques based on k-mer frequency analysis and AI-based clustering.

The methodology maximizes computational efficiency by redesigning data structures for parallel processing and optimizing memory usage. Experimental tests confirm that the new approach significantly improves performance on large-scale datasets.

In the best-case scenarios, the new implementation achieved a memory usage reduction of over 90% (e.g., 5.7 GB vs 60 GB), while maintaining result compatibility in over 90% of tested cases. Execution times were also improved, with up to 3× speedup observed on 96-core AWS instances.

When analyzing up to 128 synthetic genomes, the system produced more accurate distributions aligned with the true pangenomic model generated by PANPROVA, especially in core/singleton gene ratios.

This work demonstrates how low-level parallelism and memory-aware algorithm design can make pangenome analysis more scalable, even for computationally demanding species such as Escherichia coli.

Short description

Feb. 2022 - Oct. 2022
Git repository

High-performance C++ implementation for scalable pangenome analysis via k-mer-based gene clustering.

Resources