Khanh N. Dinh Lab

Office 601B

Schermerhorn Hall

Columbia University

1190 Amsterdam Avenue

New York, NY 10027

Our long-term research interests are in developing mathematical models and bioinformatic algorithms to characterize the formation and selection of biological processes during tumorigenesis. The methods are applicable to a wide range of cancer DNA-sequencing data, from bulk level down to the resolution of single cells, and help elucidating the evolutionary histories of mutations and different copy number aberration (CNA) mechanisms. Two major projects that we are working on are:

Inference of chromosomal instability rates and selection coefficients of genomic regions in ovarian and breast cancers
Detection of clonality and estimation of clonal ages and growth rates from bulk DNA-sequencing data

news

Nov 26, 2025	Two of our works in 2025, which introduced algorithms CINner for simulating chromosomal instability and ABC-SMC-DRF for parameter inference, are included in IICD’s annual highlights.
Nov 05, 2025	Our new paper introduces a novel noise-insensitive and efficient parameter inference algorithm.
Nov 04, 2025	Our work on analyzing genomic, proteomic and metabolomic progression of bladder cancer at the whole-organ scale is highlighted on Nature Reviews Urology.
Jul 31, 2025	We are showcased in a video featuring IICD at the Joint Statistical Meetings 2025.
Jul 28, 2025	Sara El Baghdadi and Ethan Cohen, our 2025 Alliance interns, offer insights into their time in our Lab here.
Apr 04, 2025	CINner, our computational framework for chromosomal instability, is published.
Feb 17, 2025	Read the profile of Zijin Xiang, a former intern in our lab.

selected publications

Understanding the finite state projection and related methods for solving the chemical master equation

Khanh N. Dinh, and Roger B. Sidje

Physical Biology, 2016

Abs DOI Bib HTML PDF

The finite state projection (FSP) method has enabled us to solve the chemical master equation of some biological models that were considered out of reach not long ago. Since the original FSP method, much effort has gone into transforming it into an adaptive time-stepping algorithm as well as studying its accuracy. Some of the improvements include the multiple time interval FSP, the sliding windows, and most notably the Krylov-FSP approach. Our goal in this tutorial is to give the reader an overview of the current methods that build on the FSP.
@article{dinh2016understanding, dimensions = {true}, title = {Understanding the finite state projection and related methods for solving the chemical master equation}, author = {Dinh, Khanh N. and Sidje, Roger B.}, journal = {Physical Biology}, volume = {13}, number = {3}, pages = {035003}, year = {2016}, publisher = {IOP Publishing}, doi = {10.1088/1478-3975/13/3/035003}, }
Statistical inference for the evolutionary history of cancer genomes

Khanh N. Dinh, Roman Jaksik, Marek Kimmel, Amaury Lambert, and Simon Tavaré

Statistical Science, 2020

Abs DOI Bib HTML PDF

Recent years have seen considerable work on inference about cancer evolution from mutations identified in cancer samples. Much of the modeling work has been based on classical models of population genetics, generalized to accommodate time-varying cell population size. Reverse-time, genealogical views of such models, commonly known as coalescents, have been used to infer aspects of the past of growing populations. Another approach is to use branching processes, the simplest scenario being the classical linear birth-death process. Inference from evolutionary models of DNA often exploits summary statistics of the sequence data, a common one being the so-called Site Frequency Spectrum (SFS). In a bulk tumor sequencing experiment, we can estimate for each site at which a novel somatic point mutation has arisen, the proportion of cells that carry that mutation. These numbers are then grouped into collections of sites which have similar mutant fractions. We examine how the SFS based on birth-death processes differs from those based on the coalescent model. This may stem from the different sampling mechanisms in the two approaches. However, we also show that despite this, they are quantitatively comparable for the range of parameters typical for tumor cell populations. We also present a model of tumor evolution with selective sweeps, and demonstrate how it may help in understanding the history of a tumor as well as the influence of data pre-processing. We illustrate the theory with applications to several examples from The Cancer Genome Atlas tumors.
@article{dinh2020statistical, dimensions = {true}, title = {Statistical inference for the evolutionary history of cancer genomes}, author = {Dinh, Khanh N. and Jaksik, Roman and Kimmel, Marek and Lambert, Amaury and Tavar{\'e}, Simon}, journal = {Statistical Science}, volume = {35}, number = {1}, pages = {129--144}, year = {2020}, publisher = {JSTOR}, doi = {10.1214/19-sts7561}, }
Inference of chromosome selection parameters and missegregation rate in cancer from DNA-sequencing data

Zijin Xiang, Zhihan Liu, and Khanh N. Dinh

Scientific Reports, 2024

Abs DOI Bib HTML PDF

Aneuploidy is frequently observed in cancers and has been linked to poor patient outcome. Analysis of aneuploidy in DNA-sequencing (DNA-seq) data necessitates untangling the effects of the Copy Number Aberration (CNA) occurrence rates and the selection coefficients that act upon the resulting karyotypes. We introduce a parameter inference algorithm that takes advantage of both bulk and single-cell DNA-seq cohorts. The method is based on Approximate Bayesian Computation (ABC) and utilizes CINner, our recently introduced simulation algorithm of chromosomal instability in cancer. We examine three groups of statistics to summarize the data in the ABC routine: (A) Copy Number-based measures, (B) phylogeny tip statistics, and (C) phylogeny balance indices. Using these statistics, our method can recover both the CNA probabilities and selection parameters from ground truth data, and performs well even for data cohorts of relatively small sizes. We find that only statistics in groups A and C are well-suited for identifying CNA probabilities, and only group A carries the signals for estimating selection parameters. Moreover, the low number of CNA events at large scale compared to cell counts in single-cell samples means that statistics in group B cannot be estimated accurately using phylogeny reconstruction algorithms at the chromosome level. As data from both bulk and single-cell DNA-sequencing techniques becomes increasingly available, our inference framework promises to facilitate the analysis of distinct cancer types, differentiation between selection and neutral drift, and prediction of cancer clonal dynamics.
@article{xiang2024inference, dimensions = {true}, title = {Inference of chromosome selection parameters and missegregation rate in cancer from DNA-sequencing data}, author = {Xiang, Zijin and Liu, Zhihan and Dinh, Khanh N.}, journal = {Scientific Reports}, volume = {14}, number = {1}, pages = {17699}, year = {2024}, publisher = {Nature Publishing Group UK London}, doi = {10.1038/s41598}, }
Approximate Bayesian computation sequential Monte Carlo via random forests

Khanh N. Dinh, Cécile Liu, Zijin Xiang, Zhihan Liu, and Simon Tavaré

Statistics and Computing, 2025

Abs DOI Bib HTML PDF

Approximate Bayesian Computation (ABC) is a popular inference method when likelihoods are hard to come by. Practical bottlenecks of ABC applications include selecting statistics that summarize the data without losing too much information or introducing uncertainty, and choosing distance functions and tolerance thresholds that balance accuracy and computational efficiency. Recent studies have shown that ABC methods using random forest (RF) methodology perform well while circumventing many of ABC’s drawbacks. However, RF construction is computationally expensive for large numbers of trees and model simulations, and there can be high uncertainty in the posterior if the prior distribution is uninformative. Here we further adapt random forests to the ABC setting in two ways. The first exploits distributional random forests to provide a direct method for inferring the joint posterior distribution of parameters of interest, while the second describes a sequential Monte Carlo approach which updates the prior distribution iteratively to focus on the most likely regions in the parameter space. We show that the new methods can accurately infer posterior distributions for a wide range of deterministic and stochastic models in different scientific areas.
@article{dinh2025approximate, dimensions = {true}, title = {Approximate Bayesian computation sequential Monte Carlo via random forests}, author = {Dinh, Khanh N. and Liu, C{\'e}cile and Xiang, Zijin and Liu, Zhihan and Tavar{\'e}, Simon}, journal = {Statistics and Computing}, year = {2025}, doi = {10.1007/s11222-025-10748-x}, }
CINner: modeling and simulation of chromosomal instability in cancer at single-cell resolution

Khanh N. Dinh, Ignacio Vázquez-Garcı́a, Andrew Chan, Rhea Malhotra, Adam Weiner, Andrew W. McPherson, and Simon Tavaré

PLoS Computational Biology, 2025

Abs DOI Bib HTML PDF

Cancer development is characterized by chromosomal instability, manifesting in frequent occurrences of different genomic alteration mechanisms ranging in extent and impact. Mathematical modeling can help evaluate the role of each mutational process during tumor progression, however existing frameworks can only capture certain aspects of chromosomal instability (CIN). We present CINner, a mathematical framework for modeling genomic diversity and selection during tumor evolution. The main advantage of CINner is its flexibility to incorporate many genomic events that directly impact cellular fitness, from driver gene mutations to copy number alterations (CNAs), including focal amplifications and deletions, missegregations and whole-genome duplication (WGD). We apply CINner to find chromosome-arm selection parameters that drive tumorigenesis in the absence of WGD in chromosomally unstable cancer types from the Pan-Cancer Analysis of Whole Genomes (PCAWG, n=718). We found that the selection parameters predict WGD prevalence among different chromosomally unstable tumors, hinting that the selective advantage of WGD cells hinges on their tolerance for aneuploidy and escape from nullisomy. Analysis of inference results using CINner across cancer types in The Cancer Genome Atlas (n=8207) further reveals that the inferred selection parameters reflect the bias between tumor suppressor genes and oncogenes on specific genomic regions. Direct application of CINner to model the WGD proportion and fraction of genome altered (FGA) in PCAWG uncovers the increase in CNA probabilities associated with WGD in each cancer type. CINner can also be utilized to study chromosomally stable cancer types, by applying a selection model based on driver gene mutations and focal amplifications or deletions (chronic lymphocytic leukemia in PCAWG, n=95). Finally, we used CINner to analyze the impact of CNA probabilities, chromosome selection parameters, tumor growth dynamics and population size on cancer fitness and heterogeneity. We expect that CINner will provide a powerful modeling tool for the oncology community to quantify the impact of newly uncovered genomic alteration mechanisms on shaping tumor progression and adaptation.
@article{dinh2025cinner, dimensions = {true}, title = {CINner: modeling and simulation of chromosomal instability in cancer at single-cell resolution}, author = {Dinh, Khanh N. and V{\'a}zquez-Garc{\'\i}a, Ignacio and Chan, Andrew and Malhotra, Rhea and Weiner, Adam and McPherson, Andrew W. and Tavar{\'e}, Simon}, journal = {PLoS Computational Biology}, year = {2025}, doi = {10.1371/journal.pcbi.1012902}, }