当前课程知识点:2016年清华大学研究生学位论文答辩(一) > 第三周 自动化系——Aziz > 个人答辩陈述 > 个人答辩陈述
返回《2016年清华大学研究生学位论文答辩(一)》慕课在线视频课程列表
返回《2016年清华大学研究生学位论文答辩(一)》慕课在线视频列表
各位老师同学大家好
很高兴大家能够来参加Aziz Khan
的博士生毕业论文答辩
下面呢
首先由我来代表
自动化系博士学位论文答辩委员会介绍
各位出席的老师和专家
本次的委员会的
主席呢是李梢老师
是来自清华大学
卜东波老师呢是来自中科院计算所
及来自中科院遗传所的王秀杰老师
来自北京大学的邓明华老师
来自清华大学的张学工老师
和来自清华大学的汪小我老师
下面呢
我对Aziz的基本情况做一个基本介绍
那他今天答辩的题目是
Prediction, Analysis and Annotation of
Super-enhancers in the Human and Mouse Genomes.
那下面请答辩委员会主席李梢老师主持会议
那下面就请aziz同学做毕业论文的答辩报告
时间是40分钟
Good morning everyone
Thank you for coming today.
I am Aziz Khan,
PhD student with Prof. Xuegong Zhang.
So, today I am going to share
with you four years of my life,
three stories in 120 pages in 30 minutes.
So, the title of my PhD dissertation is
Prediction,
Analysis and Annotation of Super-enhancers
in the Human and Mouse Genomes.
Super-enhancers are a recent breakthrough
in gene regulation.
These are clusters of active enhancers,
they are densely occupied
by master transcription factors,
Mediators and also transcriptional regulators.
They control cell-identity and disease.
So, enhancers are cis-regulatory elements in
the non-coding part of DNA,
which have these binding sites
for transcription factors.
Which then further recruited to
the proximal gene through the looping mechanism,
which is mediated by Cohesion and also Mediator.
The primary role of enhancer is to enhance
the expression of gene,
it can be active,
that’s specific marks are associated with
active enhancer (H3K27ac and also H3K4me1)
and it can be inactive
if other histone marks are associated.
It has been 30 years since we know the first enhancer.
Since then there has been tremendous development
in methodology and technology
to understand and to find
these enhancer regions in the genome.
Like ChIP-seq,
DNase-seq or ChIP-seq for histone modifications.
So, international consortiums like ENCODE,
Roadmape Epigenomics Project or FANTOM.
They used these technologies
(like FANTOM used CAGE technology),
to identify and annotate these enhancers in several cell-types.
So, currently we have 1 million enhancers in human genome
and thousands of them are in a single cell.
The identification of cell-type-specific
enhancers is challenging
because we don’t know any cell-type-specific marks
or transcription factors for most of the cell-types.
But, in ES cell, interestingly
the master TFs (Sox2, Oct4 and Nanog),
the co-bound regions
three factors have 100% enhancer activity.
Further, interestingly these co-bound regions,
small fraction of these regions occupy
large fraction of these enhancers,
named ‘super-enhancers’.
These are in cluster manner and
these are occupied
by all these transcription factors,
which play important role
in the biology of embryonic stem cell.
So, we look into the distribution of ChIP-seq,
Med1 have very high signal
at super-enhancers regions,
but we have very low/small signal
at typical enhancer regions.
The number of super-enhancers is very small
(231 in ES cell) and ~8000 are typical enhancers.
And also if look into ChIP-seq
for other factors (H3K27ac),
we have higher signal at super-enhancers.
And we have higher (signal)
at the constituents of super-enhancers.
By constituent,
I mean is an individual enhancer within that cluster.
Remarkably, the genes which are associated with
ES cell (Sox2, Oct4, Nanog)
are associated or linked
with these super-enhancer regions
which are few hundreds,
but they are not associated with other
thousands of typical enhancers.
So, these super-enhancer regions can be used to
identify cell-identity genes and factors.
So, super-enhancers are cell-type-specific
and they very important role to control
the cell-identity and disease.
Several disease-associated SNPs has been found
in super-enhancer regions
and they also control the disease tissue/sample.
You will find a super-enhancer in a disease sample
but you will not find super-enhancer for
the same cell-type in normal one (sample).
More interestingly these super-enhancers
are sensitive to perturbation,
by perturbation I mean,
if you have a super-enhancer
which is associated
with the expression of oncogene Myc,
if this super-enhancer is disrupted
using transcriptional drug (JQ1),
the expression of
this Myc gene was drastically reduced.
So, it shows that these super-enhancers
are very fascinating features in our genome
and these can be used to
treat life threating disease like cancer.
Super-enhancers were first discovered
by Rick Young’s lab in late 2013.
They reported these findings in
several series of papers.
Since then research community has adopted
this concept quickly
and to further understand
the importance of these regions,
so papers came after that to
understand these super-enhancers
in other disease and also in other cell-types.
It shows that, the growing interest in this area,
there is a greater need to
develop computational resource
and also to perform extensive analysis to understand
these super-enhancers in further detailed
and also to help the research community
to answer several outstanding questions.
The aim of my PhD dissertation is
to perform extensive analysis
and also to create and develop resources
and method to help the search community.
So,
today I am going to share with you three stories,
which are divided into three parts
– analysis, prediction and annotation.
In the analysis part,
we are looking into the differences
and similarities between super
and stretch enhancers.
So, stretch enhancer is a parallel concept
with super-enhancer,
these are not clusters of enhancers,
these are actually larger enhancers,
which are larger than 3KB.
Next, I will introduce a computational method
(machine learning based),
which can accurately predict super-enhancers.
We validated this (method) using 10-fold cross-validation
and also by using independent datasets.
In the final part of my talk,
I will introduce a database of super-enhancers,
which is comprehensive in it’s nature.
So, in the first part,
we performed a comparative analysis of
super and stretch enhancers
across 10 human cell-types
by integrating transcriptomic and epigenomic data.
When we look into the super and stretch enhancers,
like super-enhancer are defined based ChIP-seq signal.
Once you rank these enhancers based on Med1 signal,
you will get this hockey-curve,
where the slope is 1,
these been defined as super-enhancers,
but the rest are in thousands are typical enhancers.
In the case of stretch enhancers,
here in x-axis you have length
and in y-axis you cell-type-specificity.
If you look into the cell-type-specificity,
it increases due to increase in length.
The authors put a cut-off over here,
an enhancer
which is larger than 3KB is stretch enhancer
and it’s more cell-type-specific.
So, these stretch enhancers been reported
by Francis Collin’s lab.
Since the discovery of these two parallel concept,
there is a confusion among research community to
differentiate these two concepts
and some even gave collective name SEs
(super/stretch enhancers).
But there is no comparative
had performed to understand
how they are similar or
different in terms of sequence,
histone modification profile
and also their ability to express genes.
So, we performed an extensive analysis of
super-enhancers in 10 different cell-types.
We downloaded ChiP-sesq data,
DNase-seq and RNA-seq data from ENCODE.
We identified super and stretch enhancers
and we performed further downstream analysis;
like overlap analysis,
gene expression and also gene ontology.
So, when we look into distribution of
these super and stretch enhancers,
we can see from this plot that
super-enhancers are less in number
as compared in number
as compared to stretch enhancers.
I average, there are 11,000 stretch enhancers
in a single cell-types with an average size 5kb,
but super-enhancers are less in number,
which are 700 (in average) in each cell
and size is 4 times larger than stretch enhancers,
which is 22kb.
Next, we associated genes to these super-enhancers
and stretch enhancers.
Here, this boxplot shows,
blue color is super-enhancer
in orange we have stretch enhancers.
So, we found that the genes associated with
super-enhancers are highly expressed
as compared to stretch enhancers.
And the difference is statistically significant.
So this is in embryonic stem cell
and we found similar patterns in
other 9 cell-types.
The difference is statistically significant.
Further,
we look into the histone modification profile.
We look into the H3K27ac.
We found that, we have higher signal for H3K27ac,
which is a mark for active enhancers,
across all these three cell-types.
Further, we found super
and stretch have a have almost equal H3K4me1,
which is a mark for inactive/poised enhancers.
Further, more interestingly
we found super-enhancers have higher H3Kme3,
which is a mark for active gene or promoter.
And then we looked into RNA Pol II.
Again, we found higher signal at super-enhancers
as compared to stretch enhancers
in all these three cell-types.
So, this suggests that
super-enhancers might work as promoter.
This can be validated using,
gene editing technique CRISPR –cas9
and also majority of stretch enhancers can be poised.
We looked into the sequence-specific differences
and we found
super-enhancers are significantly conserved
as compared to stretch enhancers.
This is gain in ES cell,
the difference is statistically significant
and we similar partners almost in
all of the nine cell-types.
So, further we performed an overlap analysis
and we found that majority of super-enhancers
do overlap with
a small fraction of stretch enhancers,
which is only 13%.
And majority of these stretch enhancers
don’t overlap with super-enhancers.
Based on this overlap analysis,
we divided these two groups into three.
The first one we call it ‘super-stretch’,
these are the one
they do overlap with super-enhancers (only 13%).
And next ‘stretch enhancers’,
which don’t overlap with super-enhancers
and 3rd is super-enhancers.
Again,
we associated genes with these three groups,
we found that genes associated with super
and super-stretch are highly expressed
as compared to stretch.
We found similar patterns across 10 cell-types.
Next, we looked into the cell-type specificity,
here we can see H3K27ac signal
across 5 different cell-types,
in these three groups.
Next, we assigned genes to H1 (ESC)
and we looked into the expression of same genes
in other cell-types.
In the first two groups
we found these H1 associated genes
are significant expressed
as compared to the other.
But in the last group,
it seems like,
most of these genes have housekeeping functions.
Further, we looked into the gene ontology terms
and
we found cell-type-specific ontology terms like;
‘stem cell maintenance’, ‘stem cell development’,
and also ‘stem cell differentiation’,
in super
and super-stretch but not in stretch enhancers.
And we also looked into key cell-identity genes,
like there are certain known genes
which are specific to cells,
like in ES cell we found SOX2, OCT4 and NANOG
in super and super-stretch but not in stretch enhancers.
We observed similar in K562, GATA1, JUN and TAL1
in super and super-stretch
but not in stretch enhancers.
And also in Islets cell.
So, to sum up this part.
We found super-enhancer associated genes
are significantly expressed
as compared to stretch enhancers.
And a small fraction of stretch enhancers
do overlap with super-enhancers,
which we call, ‘super-stretch enhancers’,
which are more cell-type-specific.
And
we also observed significantly higher signal for
H3K27ac, H3Kme3 and Pol II.
Based on that,
we suggest that
super-enhancers might work as promoter.
The manuscript for this work is under preparation
for Epigenetics & Chromatin.
So in the next part,
I will introduce to you a computational model,
which can accurately predict super-enhancers.
We developed it
by integrating several types of datasets.
We also found some key features of
super-enhancers.
So, research has shown that;
several chromatin regulators
and transcriptional apparatus occupy super-enhancers,
but there has been no feature analysis done
yet find their relative importance
or combinatorial importance.
Further, super-enhancers are densely occupied
by master TFs
and Mediator but these master TFs are not known
for most of the cell-types
and doing ChIP-seq for
Mediator is pretty difficult.
And there is no computational model
has been established yet.
So, by integrating several types of public data.
We downloaded more than
ChIP-seq datasets
for different histone modification,
chromatin regulators and transcription factors.
And we also used DNA motifs
and other sequence-specific features
(conservation score, GC content) to extract features.
We performed data sampling
and then
we trained 6 state-of-the-art machine
learning models
(SVM, Random Forest, Adaboost, Decision Tree, KNN).
We validated each model
using 10-fold cross validation
and also we validated thee models using
an independent database in four human cell-types.
Once we look in these different features/factors
we used to predict super-enhancers.
Here, we have these factors at constituents of
super-enhancer regions.
Here, we have typical enhancer regions.
We can see that
these factors have significantly higher,
correlation at super-enhancer regions
and this shows that these paly combinatorial roles
in the formation of super-enhancers.
SO,
we compared 6 state-of-the-art machine
learning models
and we found that the ensemble approaches,
like Random Forest
and AdaBoost performed pretty good
as compared to the rest
but these others can be used for some extend.
But we choose to used
Random Forest for further analysis
because we achieved higher precision
and recall as compared AdaBoost.
Again, we validated our model using
independent dataset.
This is ROC plot,
on x-axis we have false positive rate
and on y-axis we have true positive rate.
We got pretty good AUC.
Next,
we ranked the different chromatin regulators
to find
the important features of super-enhancers.
Interestingly,
we found H3K27ac turn to best one, Brd4.
More interestingly Cdk8 and Cdk9
turn to be better then Med12 and p300,
which was known as super-enhancer features.
So,
again we checked the accuracy for
these each factor
and we found that Cdk8 and Cdk9
have almost similar predictive
power ass compared to H3K27ac and brd4.
And once we combined the top three features,
we achieved pretty good accuracy.
Next, we ranked transcription factors in ES cell
and surprisingly we found Smad3
turn to be the best features
then Esrrb and Klf4 which has been known
as features of super-enhancers,
found by Rick Young’s lab.
Again
we checked the predictive power of these features,
we found Smad3 have higher predictive power
then the rest.
When we combine these top three,
we achieved better.
So smad3 is a better predictor then Esrrb and Klf4
Next,
we look into the genome-wide profile these factors
we found (Cdk8, Cdk9 and Smad3).
Once we look into the ChIP-seq signal
across the super-enhancers
and also typical enhancers
which been defined based Med1.
We can see pretty much signal at super-enhancers
for all of these three
and again here is gnome browser screenshot for
these marks at the locus Sox2 gene.
Further, we found that these three factors
(cd98, Cdk9, and Smad3)
are highly correlated with Med1.
This correlation is at enhancer regions, actually.
We also found that
Smad3 is highly correlated with p300,
which is mark for enhancer.
Next, we identified super-enhancers
using these three factors
and most of these super-enhancers do overlap
with super-enhancers identified by Med1.
And more interestingly once
we look into these super-enhancers
which been found only this factors but not these
and we found cell-type-specific gene ontology terms
for these as well.
So, most of these super-enhancers do overlap.
Once we look into the ChIP-seq signal,
we have pretty much higher signal
at super-enhancer regions
but not at typical enhancers.
And we also found motifs for
Klf4/Esrrb in super-enhancers
more then typical enhancers.
Most of these genes associated with
these super-enhancers find
by four different factors, they do overlap.
So, to check this in more differentiated cell,
we try to identify super-enhancers using
Samd3 in pro-B cells.
Again, we look into the ChIP-seq signal
at super-enhancers identified by Med1,
we found similar patterns.
Here is the genome browser screenshot
at Foxo1 gene,
which is pro-B specific.
And also once we ranked these,
this hockey-curve plot,
we found that
Smad3 turn to be better then H3K27ac,
and we have better cut for Med1.
Also we have higher ChIP-seq signal at
super-enhancers then typical enhancers.
And
we found cell-type-specific gene ontology terms
as well,
by defining super-enhancers using Smad3.
So, this shows that these super-enhancers
can be defined using Cdk8, Cdk9 and Smad3.
Next, we try to identify super-enhancers
using other factors,
which already been used by research community.
Brd4, H3K27ac, Tex10
and also these ES specific (factors).
And we found that,
Smad3 is highly correlated with Med1 and p300.
Here we have cluster for Sox2, Nanog and Oct4,
these are ES specific.
And also Brd4,
Cdk8, Cdk9 is correlated with H3K27ac.
So, next we ranked these different factors,
which can do better,
based on these ES identity genes.
This is the rank of super-enhancer
and we calculated the average rank.
And we found that Smad3 and p300
turned to be better then Med1.
Here we have genome browser screenshot for
all these factors we used at the gene locus Nanog.
So, to summarize this part.
We developed a model, which we call imPROSE,
which can accurately predict super-enhancers.
And we validated our model
using independent datasets
in four human cell-types.
And further we found that Cdk8
and Cadk9 are new features,
that can define super-enhancers.
And also we developed our pipeline
as a Python package,
which is available on GitHub for public.
So, we presented this in several conferences,
ISMB 2015, Cold Spring Harbor, Suzhou
and the manuscript for this work is currently
under revision for Genome Biology.
So, in the final part,
I will introduce to you a database of
super-enhancers,
which we call it dbSUPER.
So, the motivation behind this is that,
as I mentioned earlier that super-enhancers pay
a critical role in cell-identity & disease.
And several papers have generated
super-enhancers data,
but all is dumped in supplementary files.
And currently a catalog for super-enhancers
in mouse cells is lacking.
And also there is a need to develop a database
to streamline downstream analysis
and to help the research community.
So, we developed this database,
here is the workflow.
We used also the published pipeline
and carefully curated data generated by other labs
We stored this in a MySQL database.
We provide a user-friendly website,
for the users, on our server,
which comes with several features.
It is linked with external resources
and also linked with other webservers.
It also has a user interactive interface.
It has fast searching/browsing.
Data can be downloaded in several formats.
And also have an overlap analysis tool.
So I will go through these features, quickly,
in next slides.
So, first we created a map of super-enhancers
in mouse genome.
We used ChIP-seq data
for H3K27ac from mouseENCODE.
It is the distribution of super
and typical enhancers,
we have 7% super-enhancers
and 93% are typical enhancers.
Here is genome browser screenshot for
different histone modification,
and also Pol II, CTCF
and RNA-seq at super and typical enhancers.
So,
our database provides a responsive user interface,
by responsive I mean,
you can use our database even on your tablet
or on your smartphone.
dbSUPER provides very fast searching
and browsing facilities.
You can easily view the data in these nice tables.
And once you click a specific super-enhancer,
it will show you all these details.
The super-enhancer is linked
with eternal resources,
you can download the fasta sequence
and also the wig file.
It will tell you the details,
how this super-enhancer has been identified,
which data has been used to
identify this super-enhancer.
Further,
it provides easy download and import features.
You can download our data in BED, FASTA
and UCSC genome browser tracks.
And you can import to other web-servers,
like GREAT to perform gene ontology analysis,
pathway analysis & other analysis.
It connects with Cistrome
and also Galaxy server
and also UCSC genome browser.
Further, if you have personal galaxy server,
if you want to redirect all the data,
you just put your URL here,
all the data will be redirected to
your personal Galaxy on one click.
Further, it provides this overlap analysis.
If you have a list of regions,
you are interested in.
You upload a BED file
and you will set an overlap threshold.
It will show you this nice plot
and also all the list of super-enhancers
which has been overlapped with your regions,
which can be downloaded as an Excel file.
So,
we got a very active user community
across more than 100 countries.
We have highest number (users),
4000 from US, China, UK.
Until today we got 55,000 page views an also
our database has been used,
we got 13 citations and
mentioned in several journal papers.
So, to sum this part.
We created a map of super-enhancers
in mouse genome.
And also we developed the first database of
super-enhancers
in the mouse and human genome,
which is comprehensive, integrated
(it is linked with other resources)
and provide an interactive user interface.
Currently, our database have 82,000 super-enhancers
in 102 human and 25 mouse cell and tissue types.
So the database is freely available
on our web-server.
And we already published
this manuscript in NAR 2016.
So, to conclude all these three parts.
We found significant differences between super
and stretch enhancers
in terms of sequence,
chromatin modification profile
and also RNA Pol II occupancy.
And their ability to
express cell-type-specific genes.
And further
by integrating several types of datasets,
we developed a computational model,
which we call it imPROSE,
which can accurately predict super-enhancers.
And further we found Smad3, Cdk8
and Cdk9 as novel signatures of super-enhancers,
which can be used to identify super-enhancers.
And developed the first comprehensive database of
super-enhancers.
With all this,
we extended the current understanding of
super-enhancer research
and also our developed resources
(imPROSE and dbSUPER)
can help the wider research community to
answer several outstanding questions.
So, this is the research out.
We have three papers.
We presented all this work in several conferences.
Here is the URLs for these resources
and here is another resource,
which I didn’t mentioned here.
So, I will thank my supervisor Prof. Zhang Xuegong
for his help and support during
all these four years.
Without his help
I wouldn’t be standing here today.
I would like to thank also the thesis committee
and the reviewer of
my thesis for their useful suggestions.
And I would like to thank Rick Young from MIT,
Bing Ren from UCSD and Wei Xie from Tsinghua
for their useful suggestions
on some parts of my projects.
I would like to
thank International Student Office,
Tsinghua University
and also Chinese Scholarship Council
for awarding a fully-funded scholarship to support
my PhD studies here at Tsinghua.
I would to thank the Boinformatics Division,
the faculty, staff and all the students
and especially the students at Prof. Zhang’s lab
for stimulating discussions.
I would like to thank all my friends, family
and last but not least
I would like to thank my wife,
she is sitting here,
for her support and
understanding for these four years.
So with this I will stop here,
and thank you
and I will take your questions.
-个人答辩陈述
--个人答辩陈述
-问题及答辩结果
--问题及答辩结果
-个人答辩陈述
--Video
-问题及答辩结果
--问题及答辩结果
-个人答辩陈述
--个人答辩及陈述
-问题及答辩结果
--问题及答辩结果
-个人答辩及陈述
--个人答辩及陈述
-问答及答辩结果
--问答及答辩结果
-个人答辩及陈述
--个人答辩陈述
-问题及答辩结果
--问题及答辩结果
-个人答辩陈述
--个人答辩陈述
-问题及答辩结果
--问答及答辩结果
-个人答辩陈述
--个人答辩陈述
-问答及答辩结果
--问题及答辩结果
-个人答辩陈述
--个人答辩陈述
-问答及答辩结果
--问题及答辩结果
-个人答辩陈述
--Video
-问答及答辩结果
--问题及答辩结果
-个人答辩陈述
--Video
-问答及答辩结果
--问答及答辩结果
-个人答辩陈述
--个人答辩陈述
-问答及答辩结果
--问答及答辩结果
-个人答辩陈述
--个人答辩陈述
-问答及答辩结果
--问题及答辩结果
-个人答辩陈述
--个人答辩陈述
-问答及答辩结果
--问答及答辩结果
-个人答辩陈述
--Video
-问题回答及答辩结果
--问题及答辩结果
-个人答辩陈述
--个人答辩陈述
-个人答辩陈述
--个人答辩陈述
-问答及答辩结果
--问答及答辩结果
-个人答辩陈述
--个人答辩陈述
-个人答辩陈述
--个人答辩陈述
-问答及答辩结果
--问题及答辩结果
-个人答辩陈述
--个人答辩陈述
-问答及答辩结果
--问题及答辩结果
-个人答辩陈述
--个人答辩陈述
-问答及及答辩结果
--问题及答辩结果
-个人答辩陈述
--个人答辩陈述
-问答及答辩结果
--问答及答辩结果
-个人答辩陈述
--个人答辩陈述
-问答及答辩结果
--问题及答辩结果
-个人答辩陈述
--个人答辩陈述
-问答及答辩结果
--问题及答辩结果
-个人答辩陈述
--个人答辩陈述
-问答及答辩结果
--问题及答辩结果
-个人答辩陈述
--个人答辩陈述
-问答及答辩结果
--问题及答辩结果
-个人答辩陈述
--个人答辩陈述
-问答及答辩结果
--问答及答辩结果
-个人答辩陈述
--Video
-问答及答辩结果
--Video
-个人学术感言
--Video
-个人答辩陈述
--Video
-问答及答辩结果
--Video
-个人学术感言
--Video
-个人答辩陈述
--Video
-问答及答辩结果
--Video
-个人学术感言
--Video
-个人答辩陈述
--Video
-问答及答辩结果
--Video
-个人学术感言
--Video