Chromosomal instability and its impact on selection during cancer development

Mathematical and computational methods to extract the occurrence rates and selection coefficients of chromosomal instability from DNA-sequencing data


Cancer evolution has been studied extensively as a process of accumulating point mutations, which affect one or a few nucleotides at specific locations in the genome. However, some cancers have been known to be driven mostly by copy number aberrations (CNAs). These events can be inferred by examining the total and allele-specific copy numbers across the cancer genome. Chromosomal instability (CIN) is said to occur when a tumor exhibits a high number of CNAs.

Recent developments in single-cell DNA-sequencing have provided an unparalleled view into the diversity and ongoing evolution within a tumor. Direct Library Development+ (DLP+), developed by the Sohrab Shah Lab, is capable of producing genomic information for tens of thousands of cells, without bias due to amplification or non-uniform coverage (Laks et al., 2019). Using DLP+, researchers have uncovered varying levels of CIN in different cancers, both between patients with the same tumor characteristic and across different cancer types.

DLP+ (Laks et al., 2019) provides copy number (CN) information at the cell level. Left: Overview of DLP+ experimental and computational pipeline. Right: Total copy numbers (top) and minor allele fractions (bottom) across the genome in one cell from an ovarian tumor.

By grouping cells with similar copy numbers into clones and tracking them over time, DLP+ data provides a picture of competition between different clones. In some experiments, some clones were observed expanding through time, while others became extinct. This indicates that selection plays an important role during tumor growth, and certain CNAs are preferred over others as cancer progresses (Salehi et al., 2021).

Evolution of clones with distinct copy numbers from a triple-negative breast tumor over time (Salehi et al., 2021). Left: Phylogeny of cells based on their copy number profiles. Cells with similar copy numbers are grouped into a clone. Right: Temporal trajectory of clonal fractions in the cell population (= number of cells in the clone, divided by number of cells in total).

Furthermore, applications of DLP+ have led to the realization that knock-outs of certain important genes, such as TP53 and BRCA1/2 in the mammary epithelium, are associated with a significant increase in CIN (Funnell et al., 2022). This manifests in higher number of polyploid cells (resulting from whole-genome duplications) and more chromosome missegregations. These results explain the high numbers of CNAs that have been observed in ovarian and breast cancers.

Copy number changes associated with key genes being turned off in mammary epithelial cell lines (Funnell et al., 2022). Top: Total and allele-specific copy numbers in cells with TP53 knocked out (left) and both TP53 and BRCA1 knocked out (right). Bottom: Statistics of cell populations, depending on whether TP53, BRCA1 or BRCA2 are knocked out.

We seek to systematically study how CIN arises and affects the selection landscape during tumor growth. This requires a mathematical framework that incorporates different CNA mechanisms that occur during CIN. We developed CINner, an algorithm to simulate how CIN arises and affects the selection landscape during cancer growth (Dinh et al., 2024). CINner, available as an R package on Github, is designed to efficiently model a cell population undergoing different types of CNAs and mutations, which change the cell karyotypes and increase the clonal diversity.

Mathematical model underlying CINner.

CINner currently supports different CNA mechanisms, including whole-genome duplications, whole-chromosome and chromosome-arm missegregations, focal amplifications and deletions. We included three different selection models, designed to quantify the fitness of chromosome-arm level CNAs or driver mutations, or both.

Schematics for the selection model of chromosome arms (left) and driver mutations (right).

When applied to whole genome sequencing data across all cancers in The Cancer Genome Atlas (TCGA), CINner inferred selection parameters for individual chromosome arms. These selection parameters strongly correlate with the gene imbalance on each arm from Davoli et al., indicating that selection rates inferred from CINner are estimates for the combined effects of genes located on different genomic regions.

Top: Schematic for the inference and analysis of cancer type-specific chromosome-arm selection parameters. Bottom: Comparison of selection rates inferred by CINner (x axis) from pan-cancer TCGA data, against gene balance scores (y axis).

The same inference routine can be applied for individual cancer types, for which CINner finds selection parameters that faithfully recreate the copy number landscapes observed in data. These parameters are inferred using samples without whole-genome duplication (WGD), from which chromosome arms can be classified as GAIN (if the selection rate > 1, hence gains of these arms are advantageous) or LOSS (if the selection rate < 1, in which case losses increase cell fitness). The count and mean selection rate across these inferred GAIN and LOSS arms predict cancer-specific WGD prevalence well. On the one hand, this confirms that WGD is an important event that remolds the selection landscape during cancer development, as has been observed experimentally. On the other hand, this reconfirms CINner’s ability to uncover biologically relevant parameters from sample cohorts of relatively small sizes.

Top: Comparison between gain/loss frequencies from data and CINner with inferred selection rates, for ovary adenocarcinoma. Bottom: Correlation between cancer-specific WGD proportion and count of GAIN arms (left) and mean selection rate of LOSS arms (right).

With CINner, we now have a framework to uncover selection parameters for individual genomic regions. However, we are also interested in finding the rates at which CNA mechanisms occur, such as chromosome missegregations. This requires more information than what bulk genomic data can offer. This is because if a particular CNA is frequently observed in data, it can be explained by a spectrum of parameters in CINner. On one extreme, it could be because the CNA occurrence is high, therefore the event takes place often across different patients even if it does not significantly impact cancer fitness. On the other extreme, it could be because the selection rate associated with the CNA event is high, in which case it takes a long time to emerge but always expands across the entire tumor when it does. The parameters therefore are confounded in bulk data alone.

In (Xiang et al., 2024), we developed a parameter inference routine that employs both bulk data and information from single-cell (sc) sequencing, such as DLP+. The algorithm uses ABC-rf, an Approximate Bayesian Computation (ABC) method with random forests that produces reliable posterior distributions with high tolerance for noise. The ABC framework summarizes both data and model simulations with statistics, for which we considered a wide range of measurements based on bulk CN, single-cell CN profiles, and the phylogeny for cells in the single-cell samples.

Top: The inference method relies on ABC-rf to find parameter posterior distributions. Bottom: Overview of some statistics based on single-cell phylogeny (left) and distance between observed and simulated CN profiles (right).

Using this framework, we are able to uncover the true values of both the missegregation probability and the selection parameters for individual chromosomes. This comparison was performed with synthetic tests, where we know the true values against which the parameter posterior distributions can be compared. The posterior distributions are unimodal, which indicates identifiability, and center around the ground truth values.

Inference of missegregation probability (top left) and selection parameters for individual chromosomes in synthetic testing. For each parameter, the posterior distribution (dark blue; broken lines indicate mean, median and mode) is inferred from a uniform prior distribution (light blue), compared to the ground truth value (black line). The inference is accurate if the posterior distribution centers around ground truth value.

Importantly, the accuracy and uncertainty of the results (quantified as Root mean square error (RMSE) and standard deviation of the posteriors, respectively) do not increase significantly for small sample sizes of either single-cell or bulk cohorts. Minimum error can be achieved with as few as 40 sc and 50 bulk samples, while minimum uncertainty is reached with 20 and 30 bulk samples. These requirements are already satisfied with currently available data for some cancer types. We expect that as DNA sequencing becomes an essential part of cancer diagnosis, our simulation and inference framework will become a pivotal tool to analyze the data toward better understanding of cancer evolution and adaptation.

RMSE and standard deviation of the posterior distributions, as a function of sample size of the single-cell cohort (left) and the bulk data (right)

ABC-rf’s insensitivity to noise is essential for incorporating a large number of statistics in our inference method. It therefore offers an incredible advantage over traditional ABC methods, which require selecting substantially important hyperparameters and therefore are not flexible for such expansive problems and models such as ours. However, ABC-rf often requires a large training set and therefore long running time. To improve its performance, we developed Approximate Bayesian Computation sequential Monte Carlo with (distributional) random forests (ABC-SMC-RF) (Dinh et al., 2024).

A tree in the random forest grows from a root node, composed of a subsample or bootstrap sample of the training set. The algorithm then repeatedly divides each node, each time by choosing a particular statistic and segregating the node’s simulations into those whose values are lower or higher than a threshold. The choice of statistic and threshold at each node is made such that the simulations in each child node are most “similar” to each other, with respect to the given statistic. Finally, the algorithm traverses each tree with the statistics measured from the data. Simulations in the same leaf that the data ends up in are deemed closer to the data, hence the parameters of those simulations form an approximation for the posterior distribution.

ABC-SMC-RF incorporates the random forest within the framework of sequential Monte Carlo (SMC), and it is available as an R package on Github. In SMC, the posterior distribution is improved through successive iterations from the prior distribution. Each iteration in ABC-SMC-RF samples the parameters from the previous iteration’s posterior, perturbs the parameters to maintain parameter diversity, then infers the next posterior distribution with random forest.

Left: Schematic for the combination of ABC-SMC-RF and CINner toward inferring parameter posterior distributions. Middle: Schematic for constructing a tree in the random forest. Right: Schematic for inferring a parameter estimate from the tree.

In test studies, ABC-SMC-RF performs well across a wide range of mathematical models, including both deterministic and stochastic models with varying complexity levels and parameter counts. Its results are on par with traditional ABC methods with carefully calibrated hyperparameters. Moreover, ABC-SMC-RF tends to perfom better than previous random forest methods, which are also calibration-free. This is because in those algorithms, most simulations with parameters sampled directly from the prior distribution are significantly different from the data. They therefore require a large training set to select the most relevant parameter regions. By comparison, ABC-SMC-RF iteratively updates the parameter distributions, therefore more of the simulations are relevant to the data. Another practical advantage is that the random forest in each iteration can be trained on smaller training sets, therefore reducing computational runtime and storage requirement.

Comparison of posterior distributions from ABC-SMC-RF (red) with a previous ABC-RF method (blue) and ABC-SMC with calibrated hyperparameters (yellow) for the Lotka-Volterra model.

References

2024

  1. dinh2024cinner.jpg
    CINner: modeling and simulation of chromosomal instability in cancer at single-cell resolution
    Khanh N. Dinh, Ignacio Vázquez-Garcı́a, Andrew Chan, and 4 more authors
    bioRxiv, 2024
  2. xiang2024inference.jpg
    Inference of chromosome selection parameters and missegregation rate in cancer from DNA-sequencing data
    Zijin Xiang, Zhihan Liu, and Khanh N. Dinh
    Scientific Reports, 2024
  3. dinh2024approximate.jpg
    Approximate Bayesian Computation sequential Monte Carlo via random forests
    Khanh N. Dinh, Zijin Xiang, Zhihan Liu, and 1 more author
    arXiv, 2024

2022

  1. funnell2022single.jpg
    Single-cell genomic variation induced by mutational processes in cancer
    Tyler Funnell, Ciara H O’Flanagan, Marc J Williams, and 8 more authors
    Nature, 2022

2021

  1. salehi2021clonal.jpg
    Clonal fitness inferred from time-series modelling of single-cell cancer genomes
    Sohrab Salehi, Farhia Kabeer, Nicholas Ceglia, and 8 more authors
    Nature, 2021

2019

  1. laks2019clonal.jpg
    Clonal decomposition and DNA replication states defined by scaled single-cell genome sequencing
    Emma Laks, Andrew McPherson, Hans Zahn, and 8 more authors
    Cell, 2019