Our current interesest and potential directions are as below (but will be transforming over time!).
More than 90% of common disease loci act through non-coding mechanisms, in which risk variants modulate disease risk by altering gene expression abundance but not protein function. Even in rare diseases, the coding function has not yet sufficiently explained ~50% of causal mechanisms in patients, leaving the majority of patients with uncertain diagnoses and suboptimal therapeutics. Defining non-coding variants to their function in gene regulation and cellular phenotype is a huge open question in the field.
We are interested in addressing this question by developing statistical methods that integrate various molecular recordings in the cells and identify which cells, when in our life, and where in our body these variants influence gene regulation. We’ve already learned that these regulatory changes are highly context-specific, and so are interested in using molecular profiling (chromatin, RNA or post-transcriptional modifications and translation?) from various aspects in our cells in time and space through collaborations with experimentalists.
While recent sequencing technology has largely revealed genetic variations among us, we still lack a comprehensive understanding of very complex genome regions. Repetitive, duplicated, or hyperpolymorphic regions were such examples, including the HLA and KIR regions. But there is a reason why they are so complicated - because they are important for our biological function such as immunity and under strong selective pressure. For example, HLA genes confer the largest number of disease associations of any locus genome-wide, with strongest effects on autoimmune diseases.
We are interested in developing statistical methods to accurately decode these complex regions by using data from new sequencing technologies (such as in T2T project) and identify yet-to-be-seen important variations in this dark matter in the genome that might play a role in human traits and diseases.
With these great genomic technologies and large-scale datasets in our hands, we are yet to be fully leveraging our genomic understanding into clinical practice. For example, currently used disease entities often include heterogenous disease conditions that could delay the delivery of most effective treatment. These disease entities have oftentimes been constructed from empirical clinical knowledge but not informed of molecular profiling of the patients.
While genetics has been conventionally focused on identifying risk alleles or molecules associated with a predefined disease category, we could use genetics for data-driven disease definitions. Genetics as a tool is ideal in this context, since genetic variation always precedes development of disease conditions, thereby suggesting causal relationships and utility in predicting outcomes. We could further identify data-driven disease subtypes by using high-dimensional genomics modalities, with effective statistical models. We can anticipate that the significant associations will point us towards key genetic and molecular features for classifying patients into fine-grained disease subtypes, that may help patients reach to most effective treatment strategies without trials and errors.