Skip to main content

An HMM-CNN Method for Inferring Natural Selection Strengths in Evolutionary History


Hyong Hark Lee, Nhung Hoang, Sara Mathieson

Swarthmore Computer Science Department

Advances in genetic sequencing technology have given way to an abundance of accessible genetic data. Recent work on inferring populations’ evolutionary histories using genetics has turned to machine learning to take advantage of such data. We propose a Hidden Markov Model (HMM) to Convolutional Neural Network (CNN) pipeline that retrieves global and local information about a population sample to predict where and how strongly natural selection has affected that population. HMMs have been an effective unsupervised method for capturing general trends across the entire genome. We predict that CNNs – one successful model of deep learning – can learn to detect local patterns within a genomic region, with the help of global information via HMMs. Our objective is to develop an integrated method to identify regions of natural selection from raw genetic data. Our goal is to improve upon traditional methods, which use summary statistics to capture measurable information about a population’s evolutionary events. Summary statistics reduce the dimension of the information drastically, and the results are often affected by confounding variables. The model trains on population genetic data generated by coalescent simulators. Results prove that the integrated model, where deep learning takes advantage of the global information learned by HMM, performs more accurately and reliably than a CNN with only original sequence information. The model is used to infer natural selection strengths in regions of chromosome 2 of the population of individuals in Los Angeles, CA with  Mexican ancestry, via the 1000 Genomes Project (