Mar 21, 2026

Science

The pangenome

Reframing the Genome

My recent engagement with the concept of the pangenome has prompted a fundamental reassessment of assumptions I had long taken for granted in genomics. The notion of a single, authoritative human reference genome—treated as a stable coordinate system for biological interpretation—now appears less like a neutral scaffold and more like a historically contingent construct. While undeniably useful, it embodies a simplification that obscures the true extent of human genetic diversity. What is increasingly evident is that our analytical frameworks have been shaped as much by technical convenience as by biological reality, and that this mismatch is beginning to constrain both discovery and interpretation.

The Bias of the Reference Genome


The GRCh38 reference genome, often positioned as a universal standard, is in fact derived from a narrow sampling of individuals, with disproportionate representation from a single donor. This introduces a structural bias that is not merely demographic but computational. Alignment-based methods implicitly assume that the reference contains the relevant sequence space against which variation can be measured. However, when confronted with sequences absent from the reference—particularly those prevalent in underrepresented populations—the system fails not gracefully, but silently. The analogy that comes to mind is not simply an incomplete dictionary, but a linguistic model that lacks the grammar to even recognize certain expressions as valid. In this sense, absence in the reference is too often conflated with absence in biology.

Unmapped Reads as a Signal


What I find particularly striking is the treatment of unmapped reads within standard sequencing pipelines. The process of fragmenting DNA, aligning reads, and cataloguing variation is well understood, yet the epistemological implications of what is discarded are less frequently interrogated. Reads that fail to align—often representing structurally novel or population-specific sequences—are routinely filtered out as noise or contamination. From a systems perspective, this represents a profound loss of information. It suggests that our current pipelines are optimized not for completeness, but for compatibility with a predefined model. I would argue that unmapped data should be reinterpreted as a high-value signal, indicative of precisely those genomic regions that fall outside the canonical reference and therefore warrant closer scrutiny.

Toward a Graph-Based Genomic Paradigm


The pangenome, particularly in its graph-based representation, offers a conceptual and technical alternative that aligns more closely with biological complexity. Rather than imposing a linear coordinate system, it encodes genomic variation as a network of possible paths, allowing multiple sequences to coexist within a unified framework. The oft-cited “subway map” analogy is useful, but I think the deeper implication lies in how this model reshapes our analytical assumptions. Variation is no longer an exception to be catalogued relative to a norm, but an intrinsic feature of the system itself. From my perspective, the real innovation is not simply the inclusion of previously missing sequences, but the shift toward a representation that treats diversity as foundational rather than peripheral. This has far-reaching implications—not only for improving representation across populations, but for redefining how we detect, interpret, and ultimately understand genomic variation.