Chromosomal instability and its impact on selection during cancer development

Mathematical and computational methods to extract the occurrence rates and selection coefficients of chromosomal instability from DNA-sequencing data

Cancer evolution has been studied extensively as a process of accumulating point mutations, which affect one or a few nucleotides at specific locations in the genome. However, some cancers have been known to be driven mostly by copy number aberrations (CNAs). These events can be inferred by examining the total and allele-specific copy numbers across the cancer genome. Chromosomal instability (CIN) is said to occur when a tumor exhibits a high number of CNAs.

Recent developments in single-cell DNA-sequencing have provided an unparalleled view into the diversity and ongoing evolution within a tumor. Direct Library Development+ (DLP+), developed by the Sohrab Shah Lab, is capable of producing genomic information for tens of thousands of cells, without bias due to amplification or non-uniform coverage (Laks et al., 2019). Using DLP+, researchers have uncovered varying levels of CIN in different cancers, both between patients with the same tumor characteristic and across different cancer types.

DLP+ (Laks et al., 2019) provides copy number (CN) information at the cell level. Left: Overview of DLP+ experimental and computational pipeline. Right: Total copy numbers (top) and minor allele fractions (bottom) across the genome in one cell from an ovarian tumor.

By grouping cells with similar copy numbers into clones and tracking them over time, DLP+ data provides a picture of competition between different clones. In some experiments, some clones were observed expanding through time, while others became extinct. This indicates that selection plays an important role during tumor growth, and certain CNAs are preferred over others as cancer progresses (Salehi et al., 2021).

Evolution of clones with distinct copy numbers from a triple-negative breast tumor over time (Salehi et al., 2021). Left: Phylogeny of cells based on their copy number profiles. Cells with similar copy numbers are grouped into a clone. Right: Temporal trajectory of clonal fractions in the cell population (= number of cells in the clone, divided by number of cells in total).

Furthermore, applications of DLP+ have led to the realization that knock-outs of certain important genes, such as TP53 and BRCA1/2 in the mammary epithelium, are associated with a significant increase in CIN (Funnell et al., 2022). This manifests in higher number of polyploid cells (resulting from whole-genome duplications) and more chromosome missegregations. These results explain the high numbers of CNAs that have been observed in ovarian and breast cancers.

Copy number changes associated with key genes being turned off in mammary epithelial cell lines (Funnell et al., 2022). Top: Total and allele-specific copy numbers in cells with TP53 knocked out (left) and both TP53 and BRCA1 knocked out (right). Bottom: Statistics of cell populations, depending on whether TP53, BRCA1 or BRCA2 are knocked out.

We seek to systematically study how CIN arises and affects the selection landscape during tumor growth. This requires a mathematical framework that incorporates different CNA mechanisms that occur during CIN. We developed CINner, an algorithm to simulate how CIN arises and affects the selection landscape during cancer growth (Dinh et al., 2025). CINner, available as an R package on Github, is designed to efficiently model a cell population undergoing different types of CNAs and mutations, which change the cell karyotypes and increase the clonal diversity.

Mathematical model underlying CINner.

CINner currently supports different CNA mechanisms, including whole-genome duplications, whole-chromosome and chromosome-arm missegregations, focal amplifications and deletions. We included three different selection models, designed to quantify the fitness of chromosome-arm level CNAs or driver mutations, or both.

Schematics for the selection model of chromosome arms (left) and driver mutations (right).

When applied to whole genome sequencing data across all cancers in The Cancer Genome Atlas (TCGA), CINner inferred selection parameters for individual chromosome arms. These selection parameters strongly correlate with the gene imbalance on each arm from Davoli et al., indicating that selection rates inferred from CINner are estimates for the combined effects of genes located on different genomic regions.

Top: Schematic for the inference and analysis of cancer type-specific chromosome-arm selection parameters. Bottom: Comparison of selection rates inferred by CINner (x axis) from pan-cancer TCGA data, against gene balance scores (y axis).

The same inference routine can be applied for individual cancer types, for which CINner finds selection parameters that faithfully recreate the copy number landscapes observed in data. These parameters are inferred using samples without whole-genome duplication (WGD), from which chromosome arms can be classified as GAIN (if the selection rate > 1, hence gains of these arms are advantageous) or LOSS (if the selection rate < 1, in which case losses increase cell fitness). The count and mean selection rate across these inferred GAIN and LOSS arms predict cancer-specific WGD prevalence well. On the one hand, this confirms that WGD is an important event that remolds the selection landscape during cancer development, as has been observed experimentally. On the other hand, this reconfirms CINner’s ability to uncover biologically relevant parameters from sample cohorts of relatively small sizes.

Top: Comparison between gain/loss frequencies from data and CINner with inferred selection rates, for ovary adenocarcinoma. Bottom: Correlation between cancer-specific WGD proportion and count of GAIN arms (left) and mean selection rate of LOSS arms (right).

With CINner, we now have a framework to uncover selection parameters for individual genomic regions. However, we are also interested in finding the rates at which CNA mechanisms occur, such as chromosome missegregations. This requires more information than what bulk genomic data can offer. This is because if a particular CNA is frequently observed in data, it can be explained by a spectrum of parameters in CINner. On one extreme, it could be because the CNA occurrence is high, therefore the event takes place often across different patients even if it does not significantly impact cancer fitness. On the other extreme, it could be because the selection rate associated with the CNA event is high, in which case it takes a long time to emerge but always expands across the entire tumor when it does. The parameters therefore are confounded in bulk data alone.

In (Xiang et al., 2024), we developed a parameter inference routine that employs both bulk data and information from single-cell (sc) sequencing, such as DLP+. The algorithm uses ABC-rf, an Approximate Bayesian Computation (ABC) method with random forests that produces reliable posterior distributions with high tolerance for noise. The ABC framework summarizes both data and model simulations with statistics, for which we considered a wide range of measurements based on bulk CN, single-cell CN profiles, and the phylogeny for cells in the single-cell samples.

Top: The inference method relies on ABC-rf to find parameter posterior distributions. Bottom: Overview of some statistics based on single-cell phylogeny (left) and distance between observed and simulated CN profiles (right).

Using this framework, we are able to uncover the true values of both the missegregation probability and the selection parameters for individual chromosomes. This comparison was performed with synthetic tests, where we know the true values against which the parameter posterior distributions can be compared. The posterior distributions are unimodal, which indicates identifiability, and center around the ground truth values.

Inference of missegregation probability (top left) and selection parameters for individual chromosomes in synthetic testing. For each parameter, the posterior distribution (dark blue; broken lines indicate mean, median and mode) is inferred from a uniform prior distribution (light blue), compared to the ground truth value (black line). The inference is accurate if the posterior distribution centers around ground truth value.

Importantly, the accuracy and uncertainty of the results (quantified as Root mean square error (RMSE) and standard deviation of the posteriors, respectively) do not increase significantly for small sample sizes of either single-cell or bulk cohorts. Minimum error can be achieved with as few as 40 sc and 50 bulk samples, while minimum uncertainty is reached with 20 and 30 bulk samples. These requirements are already satisfied with currently available data for some cancer types. We expect that as DNA sequencing becomes an essential part of cancer diagnosis, our simulation and inference framework will become a pivotal tool to analyze the data toward better understanding of cancer evolution and adaptation.

RMSE and standard deviation of the posterior distributions, as a function of sample size of the single-cell cohort (left) and the bulk data (right)

ABC-rf’s insensitivity to noise is essential for incorporating a large number of statistics in our inference method. It therefore offers an incredible advantage over traditional ABC methods, which require selecting substantially important hyperparameters and therefore are not flexible for such expansive problems and models such as ours. However, ABC-rf often requires a large training set and therefore long running time. To improve its performance, we developed Approximate Bayesian Computation sequential Monte Carlo with (distributional) random forests (ABC-SMC-RF) (Dinh et al., 2024).

A tree in the random forest grows from a root node, composed of a subsample or bootstrap sample of the training set. The algorithm then repeatedly divides each node, each time by choosing a particular statistic and segregating the node’s simulations into those whose values are lower or higher than a threshold. The choice of statistic and threshold at each node is made such that the simulations in each child node are most “similar” to each other, with respect to the given statistic. Finally, the algorithm traverses each tree with the statistics measured from the data. Simulations in the same leaf that the data ends up in are deemed closer to the data, hence the parameters of those simulations form an approximation for the posterior distribution.

ABC-SMC-RF incorporates the random forest within the framework of sequential Monte Carlo (SMC), and it is available as an R package on Github. In SMC, the posterior distribution is improved through successive iterations from the prior distribution. Each iteration in ABC-SMC-RF samples the parameters from the previous iteration’s posterior, perturbs the parameters to maintain parameter diversity, then infers the next posterior distribution with random forest.

Left: Schematic for the combination of ABC-SMC-RF and CINner toward inferring parameter posterior distributions. Middle: Schematic for constructing a tree in the random forest. Right: Schematic for inferring a parameter estimate from the tree.

In test studies, ABC-SMC-RF performs well across a wide range of mathematical models, including both deterministic and stochastic models with varying complexity levels and parameter counts. Its results are on par with traditional ABC methods with carefully calibrated hyperparameters. Moreover, ABC-SMC-RF tends to perfom better than previous random forest methods, which are also calibration-free. This is because in those algorithms, most simulations with parameters sampled directly from the prior distribution are significantly different from the data. They therefore require a large training set to select the most relevant parameter regions. By comparison, ABC-SMC-RF iteratively updates the parameter distributions, therefore more of the simulations are relevant to the data. Another practical advantage is that the random forest in each iteration can be trained on smaller training sets, therefore reducing computational runtime and storage requirement.

Comparison of posterior distributions from ABC-SMC-RF (red) with a previous ABC-RF method (blue) and ABC-SMC with calibrated hyperparameters (yellow) for the Lotka-Volterra model.

References

2025

CINner: modeling and simulation of chromosomal instability in cancer at single-cell resolution

Khanh N. Dinh, Ignacio Vázquez-Garcı́a, Andrew Chan, and 4 more authors

PLoS Computational Biology, 2025

Abs DOI Bib HTML PDF

Cancer development is characterized by chromosomal instability, manifesting in frequent occurrences of different genomic alteration mechanisms ranging in extent and impact. Mathematical modeling can help evaluate the role of each mutational process during tumor progression, however existing frameworks can only capture certain aspects of chromosomal instability (CIN). We present CINner, a mathematical framework for modeling genomic diversity and selection during tumor evolution. The main advantage of CINner is its flexibility to incorporate many genomic events that directly impact cellular fitness, from driver gene mutations to copy number alterations (CNAs), including focal amplifications and deletions, missegregations and whole-genome duplication (WGD). We apply CINner to find chromosome-arm selection parameters that drive tumorigenesis in the absence of WGD in chromosomally unstable cancer types from the Pan-Cancer Analysis of Whole Genomes (PCAWG, n=718). We found that the selection parameters predict WGD prevalence among different chromosomally unstable tumors, hinting that the selective advantage of WGD cells hinges on their tolerance for aneuploidy and escape from nullisomy. Analysis of inference results using CINner across cancer types in The Cancer Genome Atlas (n=8207) further reveals that the inferred selection parameters reflect the bias between tumor suppressor genes and oncogenes on specific genomic regions. Direct application of CINner to model the WGD proportion and fraction of genome altered (FGA) in PCAWG uncovers the increase in CNA probabilities associated with WGD in each cancer type. CINner can also be utilized to study chromosomally stable cancer types, by applying a selection model based on driver gene mutations and focal amplifications or deletions (chronic lymphocytic leukemia in PCAWG, n=95). Finally, we used CINner to analyze the impact of CNA probabilities, chromosome selection parameters, tumor growth dynamics and population size on cancer fitness and heterogeneity. We expect that CINner will provide a powerful modeling tool for the oncology community to quantify the impact of newly uncovered genomic alteration mechanisms on shaping tumor progression and adaptation.
@article{dinh2024cinner, dimensions = {true}, title = {CINner: modeling and simulation of chromosomal instability in cancer at single-cell resolution}, author = {Dinh, Khanh N. and V{\'a}zquez-Garc{\'\i}a, Ignacio and Chan, Andrew and Malhotra, Rhea and Weiner, Adam and McPherson, Andrew W. and Tavar{\'e}, Simon}, journal = {PLoS Computational Biology}, year = {2025}, doi = {10.1371/journal.pcbi.1012902}, }

2024

Inference of chromosome selection parameters and missegregation rate in cancer from DNA-sequencing data

Zijin Xiang, Zhihan Liu, and Khanh N. Dinh

Scientific Reports, 2024

Abs DOI Bib HTML PDF

Aneuploidy is frequently observed in cancers and has been linked to poor patient outcome. Analysis of aneuploidy in DNA-sequencing (DNA-seq) data necessitates untangling the effects of the Copy Number Aberration (CNA) occurrence rates and the selection coefficients that act upon the resulting karyotypes. We introduce a parameter inference algorithm that takes advantage of both bulk and single-cell DNA-seq cohorts. The method is based on Approximate Bayesian Computation (ABC) and utilizes CINner, our recently introduced simulation algorithm of chromosomal instability in cancer. We examine three groups of statistics to summarize the data in the ABC routine: (A) Copy Number-based measures, (B) phylogeny tip statistics, and (C) phylogeny balance indices. Using these statistics, our method can recover both the CNA probabilities and selection parameters from ground truth data, and performs well even for data cohorts of relatively small sizes. We find that only statistics in groups A and C are well-suited for identifying CNA probabilities, and only group A carries the signals for estimating selection parameters. Moreover, the low number of CNA events at large scale compared to cell counts in single-cell samples means that statistics in group B cannot be estimated accurately using phylogeny reconstruction algorithms at the chromosome level. As data from both bulk and single-cell DNA-sequencing techniques becomes increasingly available, our inference framework promises to facilitate the analysis of distinct cancer types, differentiation between selection and neutral drift, and prediction of cancer clonal dynamics.
@article{xiang2024inference, dimensions = {true}, title = {Inference of chromosome selection parameters and missegregation rate in cancer from DNA-sequencing data}, author = {Xiang, Zijin and Liu, Zhihan and Dinh, Khanh N.}, journal = {Scientific Reports}, volume = {14}, number = {1}, pages = {17699}, year = {2024}, publisher = {Nature Publishing Group UK London}, doi = {10.1038/s41598}, }
Approximate Bayesian Computation sequential Monte Carlo via random forests

Khanh N. Dinh, Zijin Xiang, Zhihan Liu, and 1 more author

arXiv, 2024

Abs DOI Bib HTML PDF

Approximate Bayesian Computation (ABC) is a popular inference method when likelihoods are hard to come by. Practical bottlenecks of ABC applications include selecting statistics that summarize the data without losing too much information or introducing uncertainty, and choosing distance functions and tolerance thresholds that balance accuracy and computational efficiency. Recent studies have shown that ABC methods using random forest (RF) methodology perform well while circumventing many of ABC’s drawbacks. However, RF construction is computationally expensive for large numbers of trees and model simulations, and there can be high uncertainty in the posterior if the prior distribution is uninformative. Here we adapt distributional random forests to the ABC setting, and introduce Approximate Bayesian Computation sequential Monte Carlo with random forests (ABC-SMC-(D)RF). This updates the prior distribution iteratively to focus on the most likely regions in the parameter space. We show that ABC-SMC-(D)RF can accurately infer posterior distributions for a wide range of deterministic and stochastic models in different scientific areas.
@article{dinh2024approximate, dimensions = {true}, title = {Approximate Bayesian Computation sequential Monte Carlo via random forests}, author = {Dinh, Khanh N. and Xiang, Zijin and Liu, Zhihan and Tavar{\'e}, Simon}, journal = {arXiv}, year = {2024}, doi = {10.48550/arXiv.2406.15865}, }

2022

Single-cell genomic variation induced by mutational processes in cancer

Tyler Funnell, Ciara H O’Flanagan, Marc J Williams, and 51 more authors

Nature, 2022

Abs DOI Bib HTML PDF

How cell-to-cell copy number alterations that underpin genomic instability in human cancers drive genomic and phenotypic variation, and consequently the evolution of cancer, remains understudied. Here, by applying scaled single-cell whole-genome sequencing to wild-type, TP53-deficient and TP53-deficient; BRCA1-deficient or TP53-deficient; BRCA2-deficient mammary epithelial cells (13,818 genomes), and to primary triple-negative breast cancer (TNBC) and high-grade serous ovarian cancer (HGSC) cells (22,057 genomes), we identify three distinct ’foreground’ mutational patterns that are defined by cell-to-cell structural variation. Cell- and clone-specific high-level amplifications, parallel haplotype-specific copy number alterations and copy number segment length variation (serrate structural variations) had measurable phenotypic and evolutionary consequences. In TNBC and HGSC, clone-specific high-level amplifications in known oncogenes were highly prevalent in tumours bearing fold-back inversions, relative to tumours with homologous recombination deficiency, and were associated with increased clone-to-clone phenotypic variation. Parallel haplotype-specific alterations were also commonly observed, leading to phylogenetic evolutionary diversity and clone-specific mono-allelic expression. Serrate variants were increased in tumours with fold-back inversions and were highly correlated with increased genomic diversity of cellular populations. Together, our findings show that cell-to-cell structural variation contributes to the origins of phenotypic and evolutionary diversity in TNBC and HGSC, and provide insight into the genomic and mutational states of individual cancer cells.
@article{funnell2022single, dimensions = {true}, title = {Single-cell genomic variation induced by mutational processes in cancer}, author = {Funnell, Tyler and O'Flanagan, Ciara H and Williams, Marc J and McPherson, Andrew and McKinney, Steven and Kabeer, Farhia and Lee, Hakwoo and Salehi, Sohrab and V{\'a}zquez-Garc{\'\i}a, Ignacio and Shi, Hongyu and Leventhal, Emily and Masud, Tehmina and Eirew, Peter and Yap, Damian and Zhang, Allen W. and Lim, Jamie L. P. and Wang, Beixi and Brimhall, Jazmine and Biele, Justina and Ting, Jerome and Au, Vinci and Van Vliet, Michael and Liu, Yi Fei and Beatty, Sean and Lai, Daniel and Pham, Jenifer and Grewal, Diljot and Abrams, Douglas and Havasov, Eliyahu and Leung, Samantha and Bojilova, Viktoria and Moore, Richard A. and Rusk, Nicole and Uhlitz, Florian and Ceglia, Nicholas and Weiner, Adam C. and Zaikova, Elena and Douglas, J. Maxwell and Zamarin, Dmitriy and Weigelt, Britta and Kim, Sarah H. and Da Cruz Paula, Arnaud and Reis-Filho, Jorge S. and Martin, Spencer D. and Li, Yangguang and Xu, Hong and de Algara, Teresa Ruiz and Lee, So Ra and Llanos, Viviana Cerda and Huntsman, David G. and McAlpine, Jessica N. and Consortium, IMAXT and Shah, Sohrab P. and Aparicio, Samuel}, journal = {Nature}, volume = {612}, number = {7938}, pages = {106--115}, year = {2022}, publisher = {Nature Publishing Group UK London}, doi = {10.1038/s41586-022-05249-0}, }

2021

Clonal fitness inferred from time-series modelling of single-cell cancer genomes

Sohrab Salehi, Farhia Kabeer, Nicholas Ceglia, and 31 more authors

Nature, 2021

Abs DOI Bib HTML PDF

Progress in defining genomic fitness landscapes in cancer, especially those defined by copy number alterations (CNAs), has been impeded by lack of time-series single-cell sampling of polyclonal populations and temporal statistical models. Here we generated 42,000 genomes from multi-year time-series single-cell whole-genome sequencing of breast epithelium and primary triple-negative breast cancer (TNBC) patient-derived xenografts (PDXs), revealing the nature of CNA-defined clonal fitness dynamics induced by TP53 mutation and cisplatin chemotherapy. Using a new Wright-Fisher population genetics model to infer clonal fitness, we found that TP53 mutation alters the fitness landscape, reproducibly distributing fitness over a larger number of clones associated with distinct CNAs. Furthermore, in TNBC PDX models with mutated TP53, inferred fitness coefficients from CNA-based genotypes accurately forecast experimentally enforced clonal competition dynamics. Drug treatment in three long-term serially passaged TNBC PDXs resulted in cisplatin-resistant clones emerging from low-fitness phylogenetic lineages in the untreated setting. Conversely, high-fitness clones from treatment-naive controls were eradicated, signalling an inversion of the fitness landscape. Finally, upon release of drug, selection pressure dynamics were reversed, indicating a fitness cost of treatment resistance. Together, our findings define clonal fitness linked to both CNA and therapeutic resistance in polyclonal tumours.
@article{salehi2021clonal, dimensions = {true}, title = {Clonal fitness inferred from time-series modelling of single-cell cancer genomes}, author = {Salehi, Sohrab and Kabeer, Farhia and Ceglia, Nicholas and Andronescu, Mirela and Williams, Marc J and Campbell, Kieran R and Masud, Tehmina and Wang, Beixi and Biele, Justina and Brimhall, Jazmine and Gee, David and Lee, Hakwoo and Ting, Jerome and Zhang, Allen W. and Tran, Hoa and O’Flanagan, Ciara and Dorri, Fatemeh and Rusk, Nicole and de Algara, Teresa Ruiz and Lee, So Ra and Cheng, Brian Yu Chieh and Eirew, Peter and Kono, Takako and Pham, Jenifer and Grewal, Diljot and Lai, Daniel and Moore, Richard and Mungall, Andrew J. and Marra, Marco A. and Consortium, IMAXT and McPherson, Andrew and Bouchard-C{\o}t{\'e}, Alexandre and Aparicio, Samuel and Shah, Sohrab P.}, journal = {Nature}, volume = {595}, number = {7868}, pages = {585--590}, year = {2021}, publisher = {Nature Publishing Group UK London}, doi = {10.1038/s41586-021-03648-3}, }

2019

Clonal decomposition and DNA replication states defined by scaled single-cell genome sequencing

Emma Laks, Andrew McPherson, Hans Zahn, and 48 more authors

Cell, 2019

Abs DOI Bib HTML PDF

Accurate measurement of clonal genotypes, mutational processes, and replication states from individual tumor-cell genomes will facilitate improved understanding of tumor evolution. We have developed DLP+, a scalable single-cell whole-genome sequencing platform implemented using commodity instruments, image-based object recognition, and open source computational methods. Using DLP+, we have generated a resource of 51,926 single-cell genomes and matched cell images from diverse cell types including cell lines, xenografts, and diagnostic samples with limited material. From this resource we have defined variation in mitotic mis-segregation rates across tissue types and genotypes. Analysis of matched genomic and image measurements revealed correlations between cellular morphology and genome ploidy states. Aggregation of cells sharing copy number profiles allowed for calculation of single-nucleotide resolution clonal genotypes and inference of clonal phylogenies and avoided the limitations of bulk deconvolution. Finally, joint analysis over the above features defined clone-specific chromosomal aneuploidy in polyclonal populations.

@article{laks2019clonal,
  dimensions = {true},
  title = {Clonal decomposition and DNA replication states defined by scaled single-cell genome sequencing},
  author = {Laks, Emma and McPherson, Andrew and Zahn, Hans and Lai, Daniel and Steif, Adi and Brimhall, Jazmine and Biele, Justina and Wang, Beixi and Masud, Tehmina and Ting, Jerome and Grewal, Diljot and Nielsen, Cydney and Leung, Samantha and Bojilova, Viktoria and Smith, Maia and Golovko, Oleg and Poon, Steven and Eirew, Peter and Kabeer, Farhia and Ruiz de Algara, Teresa and Lee, So Ra and Taghiyar, M. Jafar and Huebner, Curtis and Ngo, Jessica and Chan, Tim and Vatrt-Watts, Spencer and Walters, Pascale and Abrar, Nafis and Chan, Sophia and Wiens, Matt and Martin, Lauren and Scott, R. Wilder and Underhill, T. Michael and Chavez, Elizabeth and Steidl, Christian and Da Costa, Daniel and Ma, Yussanne and Coope, Robin J.N. and Corbett, Richard and Pleasance, Stephen and Moore, Richard and Mungall, Andrew J. and Mar, Colin and Cafferty, Fergus and Gelmon, Karen and Chia, Stephen and Consortium, IMAXT and Marra, Marco A. and Hansen, Carl and Shah, Sohrab P. and Aparicio, Samuel},
  journal = {Cell},
  volume = {179},
  number = {5},
  pages = {1207--1221},
  year = {2019},
  publisher = {Elsevier},
  doi = {10.1016/j.cell.2019.10.026},
}