My Research interest are statistical and computational methods for analysis of genomic data including methods for multi-loci association studies, methods for detecting and correcting for population stratication, detecting selection on disease susceptibility genes, loci dependent methods for modelling identity by descent and various topics for analysis of second generation sequencing.
My main vision on the applied side is to set up large scale genetic studies that can answer basic biological questions in medical and population genetics. These studies are both large in the sense of having many individuals but also large in having deep phenotypes and multi-omic data, genome wide genetic data, deep RNAseq data, Proteomics, gut-microbiome and metabolomics. This data driven approach will enable us to understand the biological mechanism that genetic variation acts one. Today we know many genetic variants that aect traits and diseases but for most of them we don’t understand how they do it. A molecular understanding of the pathology of diseases is a prerequisite for future rational treatment and prevention of many common diseases. A similar issue arises when variants are identified that a under adaptive selection. Without an understanding of the drivers of selection the biological knowledge gained is limited. This is why we need large and better dataset that can untangle the underlying mechanism. In order to do this we also need novel methods that can accommodate this multidimensional data which is something my lab is working hard on.
In the immediate future we will have generated massive amounts of data for interesting pop- ulations. These include the Greenlandic Inuit and Pakistani families where we now have whole genome information , deep RNAseq from whole blood, proteomic and metabolomics data from blood, and shotgun data for the gut microbiome. This data with appropriate analysis will enable us go far beyond gene-phenotype association so that we can actually understand how and why the genetic variants acts on the traits.
Although sequencing data has become common in most fields of genetics there is one sequencing type which has not received much attention despite being the most used form of sequencing. Tens of millions of individuals have now been sequenced using whole genome ultra low depth sequencing due to its use in non-invasive parental testing (NIPT) of chromosomal anomalies in the fetus. Other studies have chosen low depth sequencing in order to increase the number of samples. Cost-effective strategy with the ever-increasing demand for larger sample sizes seems to advocate for the use of medium or low coverage sequencing. Larger sample sizes sequenced at lower depths will generally lead to better population-scale estimates of genetic variation compared to sequencing at higher depths at the cost of limited sample sizes. With this appealing trade-off, we recently conducted a genomic study on ultra-low coverage sequencing data of 141K Chinese pregnant women as part of the Chinese Millionome Project. The individuals underwent a non-invasive prenatal test (NIPT) which is common for testing fetal chromosomal abnormalities. The study provided insight into the genetic structure and history of the Chinese population as well as performing genome-wide association studies (GWAS) with principal components as covariates. The study had an average depth of < 0.1X, which allowed for the much larger sample size compared to other sequencing projects. However, in order analyze millions and tens of millions of samples, there are several issues which we try to deal with