个人答辩陈述慕课视频播放-2016年清华大学研究生学位论文答辩（一）-MOOC慕课视频教程-柠檬大学

各位老师同学大家好

很高兴大家能够来参加Aziz Khan

的博士生毕业论文答辩

下面呢

首先由我来代表

自动化系博士学位论文答辩委员会介绍

各位出席的老师和专家

本次的委员会的

主席呢是李梢老师

是来自清华大学

卜东波老师呢是来自中科院计算所

及来自中科院遗传所的王秀杰老师

来自北京大学的邓明华老师

来自清华大学的张学工老师

和来自清华大学的汪小我老师

下面呢

我对Aziz的基本情况做一个基本介绍

那他今天答辩的题目是

Prediction, Analysis and Annotation of

Super-enhancers in the Human and Mouse Genomes.

那下面请答辩委员会主席李梢老师主持会议

那下面就请aziz同学做毕业论文的答辩报告

时间是40分钟

Good morning everyone

Thank you for coming today.

I am Aziz Khan,

PhD student with Prof. Xuegong Zhang.

So, today I am going to share

with you four years of my life,

three stories in 120 pages in 30 minutes.

So, the title of my PhD dissertation is

Prediction,

Analysis and Annotation of Super-enhancers

in the Human and Mouse Genomes.

Super-enhancers are a recent breakthrough

in gene regulation.

These are clusters of active enhancers,

they are densely occupied

by master transcription factors,

Mediators and also transcriptional regulators.

They control cell-identity and disease.

So, enhancers are cis-regulatory elements in

the non-coding part of DNA,

which have these binding sites

for transcription factors.

Which then further recruited to

the proximal gene through the looping mechanism,

which is mediated by Cohesion and also Mediator.

The primary role of enhancer is to enhance

the expression of gene,

it can be active,

that’s specific marks are associated with

active enhancer (H3K27ac and also H3K4me1)

and it can be inactive

if other histone marks are associated.

It has been 30 years since we know the first enhancer.

Since then there has been tremendous development

in methodology and technology

to understand and to find

these enhancer regions in the genome.

Like ChIP-seq,

DNase-seq or ChIP-seq for histone modifications.

So, international consortiums like ENCODE,

Roadmape Epigenomics Project or FANTOM.

They used these technologies

(like FANTOM used CAGE technology),

to identify and annotate these enhancers in several cell-types.

So, currently we have 1 million enhancers in human genome

and thousands of them are in a single cell.

The identification of cell-type-specific

enhancers is challenging

because we don’t know any cell-type-specific marks

or transcription factors for most of the cell-types.

But, in ES cell, interestingly

the master TFs (Sox2, Oct4 and Nanog),

the co-bound regions

three factors have 100% enhancer activity.

Further, interestingly these co-bound regions,

small fraction of these regions occupy

large fraction of these enhancers,

named ‘super-enhancers’.

These are in cluster manner and

these are occupied

by all these transcription factors,

which play important role

in the biology of embryonic stem cell.

So, we look into the distribution of ChIP-seq,

Med1 have very high signal

at super-enhancers regions,

but we have very low/small signal

at typical enhancer regions.

The number of super-enhancers is very small

(231 in ES cell) and ~8000 are typical enhancers.

And also if look into ChIP-seq

for other factors (H3K27ac),

we have higher signal at super-enhancers.

And we have higher (signal)

at the constituents of super-enhancers.

By constituent,

I mean is an individual enhancer within that cluster.

Remarkably, the genes which are associated with

ES cell (Sox2, Oct4, Nanog)

are associated or linked

with these super-enhancer regions

which are few hundreds,

but they are not associated with other

thousands of typical enhancers.

So, these super-enhancer regions can be used to

identify cell-identity genes and factors.

So, super-enhancers are cell-type-specific

and they very important role to control

the cell-identity and disease.

Several disease-associated SNPs has been found

in super-enhancer regions

and they also control the disease tissue/sample.

You will find a super-enhancer in a disease sample

but you will not find super-enhancer for

the same cell-type in normal one (sample).

More interestingly these super-enhancers

are sensitive to perturbation,

by perturbation I mean,

if you have a super-enhancer

which is associated

with the expression of oncogene Myc,

if this super-enhancer is disrupted

using transcriptional drug (JQ1),

the expression of

this Myc gene was drastically reduced.

So, it shows that these super-enhancers

are very fascinating features in our genome

and these can be used to

treat life threating disease like cancer.

Super-enhancers were first discovered

by Rick Young’s lab in late 2013.

They reported these findings in

several series of papers.

Since then research community has adopted

this concept quickly

and to further understand

the importance of these regions,

so papers came after that to

understand these super-enhancers

in other disease and also in other cell-types.

It shows that, the growing interest in this area,

there is a greater need to

develop computational resource

and also to perform extensive analysis to understand

these super-enhancers in further detailed

and also to help the research community

to answer several outstanding questions.

The aim of my PhD dissertation is

to perform extensive analysis

and also to create and develop resources

and method to help the search community.

So,

today I am going to share with you three stories,

which are divided into three parts

– analysis, prediction and annotation.

In the analysis part,

we are looking into the differences

and similarities between super

and stretch enhancers.

So, stretch enhancer is a parallel concept

with super-enhancer,

these are not clusters of enhancers,

these are actually larger enhancers,

which are larger than 3KB.

Next, I will introduce a computational method

(machine learning based),

which can accurately predict super-enhancers.

We validated this (method) using 10-fold cross-validation

and also by using independent datasets.

In the final part of my talk,

I will introduce a database of super-enhancers,

which is comprehensive in it’s nature.

So, in the first part,

we performed a comparative analysis of

super and stretch enhancers

across 10 human cell-types

by integrating transcriptomic and epigenomic data.

When we look into the super and stretch enhancers,

like super-enhancer are defined based ChIP-seq signal.

Once you rank these enhancers based on Med1 signal,

you will get this hockey-curve,

where the slope is 1,

these been defined as super-enhancers,

but the rest are in thousands are typical enhancers.

In the case of stretch enhancers,

here in x-axis you have length

and in y-axis you cell-type-specificity.

If you look into the cell-type-specificity,

it increases due to increase in length.

The authors put a cut-off over here,

an enhancer

which is larger than 3KB is stretch enhancer

and it’s more cell-type-specific.

So, these stretch enhancers been reported

by Francis Collin’s lab.

Since the discovery of these two parallel concept,

there is a confusion among research community to

differentiate these two concepts

and some even gave collective name SEs

(super/stretch enhancers).

But there is no comparative

had performed to understand

how they are similar or

different in terms of sequence,

histone modification profile

and also their ability to express genes.

So, we performed an extensive analysis of

super-enhancers in 10 different cell-types.

We downloaded ChiP-sesq data,

DNase-seq and RNA-seq data from ENCODE.

We identified super and stretch enhancers

and we performed further downstream analysis;

like overlap analysis,

gene expression and also gene ontology.

So, when we look into distribution of

these super and stretch enhancers,

we can see from this plot that

super-enhancers are less in number

as compared in number

as compared to stretch enhancers.

I average, there are 11,000 stretch enhancers

in a single cell-types with an average size 5kb,

but super-enhancers are less in number,

which are 700 (in average) in each cell

and size is 4 times larger than stretch enhancers,

which is 22kb.

Next, we associated genes to these super-enhancers

and stretch enhancers.

Here, this boxplot shows,

blue color is super-enhancer

in orange we have stretch enhancers.

So, we found that the genes associated with

super-enhancers are highly expressed

as compared to stretch enhancers.

And the difference is statistically significant.

So this is in embryonic stem cell

and we found similar patterns in

other 9 cell-types.

The difference is statistically significant.

Further,

we look into the histone modification profile.

We look into the H3K27ac.

We found that, we have higher signal for H3K27ac,

which is a mark for active enhancers,

across all these three cell-types.

Further, we found super

and stretch have a have almost equal H3K4me1,

which is a mark for inactive/poised enhancers.

Further, more interestingly

we found super-enhancers have higher H3Kme3,

which is a mark for active gene or promoter.

And then we looked into RNA Pol II.

Again, we found higher signal at super-enhancers

as compared to stretch enhancers

in all these three cell-types.

So, this suggests that

super-enhancers might work as promoter.

This can be validated using,

gene editing technique CRISPR –cas9

and also majority of stretch enhancers can be poised.

We looked into the sequence-specific differences

and we found

super-enhancers are significantly conserved

as compared to stretch enhancers.

This is gain in ES cell,

the difference is statistically significant

and we similar partners almost in

all of the nine cell-types.

So, further we performed an overlap analysis

and we found that majority of super-enhancers

do overlap with

a small fraction of stretch enhancers,

which is only 13%.

And majority of these stretch enhancers

don’t overlap with super-enhancers.

Based on this overlap analysis,

we divided these two groups into three.

The first one we call it ‘super-stretch’,

these are the one

they do overlap with super-enhancers (only 13%).

And next ‘stretch enhancers’,

which don’t overlap with super-enhancers

and 3rd is super-enhancers.

Again,

we associated genes with these three groups,

we found that genes associated with super

and super-stretch are highly expressed

as compared to stretch.

We found similar patterns across 10 cell-types.

Next, we looked into the cell-type specificity,

here we can see H3K27ac signal

across 5 different cell-types,

in these three groups.

Next, we assigned genes to H1 (ESC)

and we looked into the expression of same genes

in other cell-types.

In the first two groups

we found these H1 associated genes

are significant expressed

as compared to the other.

But in the last group,

it seems like,

most of these genes have housekeeping functions.

Further, we looked into the gene ontology terms

and

we found cell-type-specific ontology terms like;

‘stem cell maintenance’, ‘stem cell development’,

and also ‘stem cell differentiation’,

in super

and super-stretch but not in stretch enhancers.

And we also looked into key cell-identity genes,

like there are certain known genes

which are specific to cells,

like in ES cell we found SOX2, OCT4 and NANOG

in super and super-stretch but not in stretch enhancers.

We observed similar in K562, GATA1, JUN and TAL1

in super and super-stretch

but not in stretch enhancers.

And also in Islets cell.

So, to sum up this part.

We found super-enhancer associated genes

are significantly expressed

as compared to stretch enhancers.

And a small fraction of stretch enhancers

do overlap with super-enhancers,

which we call, ‘super-stretch enhancers’,

which are more cell-type-specific.

And

we also observed significantly higher signal for

H3K27ac, H3Kme3 and Pol II.

Based on that,

we suggest that

super-enhancers might work as promoter.

The manuscript for this work is under preparation

for Epigenetics & Chromatin.

So in the next part,

I will introduce to you a computational model,

which can accurately predict super-enhancers.

We developed it

by integrating several types of datasets.

We also found some key features of

super-enhancers.

So, research has shown that;

several chromatin regulators

and transcriptional apparatus occupy super-enhancers,

but there has been no feature analysis done

yet find their relative importance

or combinatorial importance.

Further, super-enhancers are densely occupied

by master TFs

and Mediator but these master TFs are not known

for most of the cell-types

and doing ChIP-seq for

Mediator is pretty difficult.

And there is no computational model

has been established yet.

So, by integrating several types of public data.

We downloaded more than

ChIP-seq datasets

for different histone modification,

chromatin regulators and transcription factors.

And we also used DNA motifs

and other sequence-specific features

(conservation score, GC content) to extract features.

We performed data sampling

and then

we trained 6 state-of-the-art machine

learning models

(SVM, Random Forest, Adaboost, Decision Tree, KNN).

We validated each model

using 10-fold cross validation

and also we validated thee models using

an independent database in four human cell-types.

Once we look in these different features/factors

we used to predict super-enhancers.

Here, we have these factors at constituents of

super-enhancer regions.

Here, we have typical enhancer regions.

We can see that

these factors have significantly higher,

correlation at super-enhancer regions

and this shows that these paly combinatorial roles

in the formation of super-enhancers.

SO,

we compared 6 state-of-the-art machine

learning models

and we found that the ensemble approaches,

like Random Forest

and AdaBoost performed pretty good

as compared to the rest

but these others can be used for some extend.

But we choose to used

Random Forest for further analysis

because we achieved higher precision

and recall as compared AdaBoost.

Again, we validated our model using

independent dataset.

This is ROC plot,

on x-axis we have false positive rate

and on y-axis we have true positive rate.

We got pretty good AUC.

Next,

we ranked the different chromatin regulators

to find

the important features of super-enhancers.

Interestingly,

we found H3K27ac turn to best one, Brd4.

More interestingly Cdk8 and Cdk9

turn to be better then Med12 and p300,

which was known as super-enhancer features.

So,

again we checked the accuracy for

these each factor

and we found that Cdk8 and Cdk9

have almost similar predictive

power ass compared to H3K27ac and brd4.

And once we combined the top three features,

we achieved pretty good accuracy.

Next, we ranked transcription factors in ES cell

and surprisingly we found Smad3

turn to be the best features

then Esrrb and Klf4 which has been known

as features of super-enhancers,

found by Rick Young’s lab.

Again

we checked the predictive power of these features,

we found Smad3 have higher predictive power

then the rest.

When we combine these top three,

we achieved better.

So smad3 is a better predictor then Esrrb and Klf4

Next,

we look into the genome-wide profile these factors

we found (Cdk8, Cdk9 and Smad3).

Once we look into the ChIP-seq signal

across the super-enhancers

and also typical enhancers

which been defined based Med1.

We can see pretty much signal at super-enhancers

for all of these three

and again here is gnome browser screenshot for

these marks at the locus Sox2 gene.

Further, we found that these three factors

(cd98, Cdk9, and Smad3)

are highly correlated with Med1.

This correlation is at enhancer regions, actually.

We also found that

Smad3 is highly correlated with p300,

which is mark for enhancer.

Next, we identified super-enhancers

using these three factors

and most of these super-enhancers do overlap

with super-enhancers identified by Med1.

And more interestingly once

we look into these super-enhancers

which been found only this factors but not these

and we found cell-type-specific gene ontology terms

for these as well.

So, most of these super-enhancers do overlap.

Once we look into the ChIP-seq signal,

we have pretty much higher signal

at super-enhancer regions

but not at typical enhancers.

And we also found motifs for

Klf4/Esrrb in super-enhancers

more then typical enhancers.

Most of these genes associated with

these super-enhancers find

by four different factors, they do overlap.

So, to check this in more differentiated cell,

we try to identify super-enhancers using

Samd3 in pro-B cells.

Again, we look into the ChIP-seq signal

at super-enhancers identified by Med1,

we found similar patterns.

Here is the genome browser screenshot

at Foxo1 gene,

which is pro-B specific.

And also once we ranked these,

this hockey-curve plot,

we found that

Smad3 turn to be better then H3K27ac,

and we have better cut for Med1.

Also we have higher ChIP-seq signal at

super-enhancers then typical enhancers.

And

we found cell-type-specific gene ontology terms

as well,

by defining super-enhancers using Smad3.

So, this shows that these super-enhancers

can be defined using Cdk8, Cdk9 and Smad3.

Next, we try to identify super-enhancers

using other factors,

which already been used by research community.

Brd4, H3K27ac, Tex10

and also these ES specific (factors).

And we found that,

Smad3 is highly correlated with Med1 and p300.

Here we have cluster for Sox2, Nanog and Oct4,

these are ES specific.

And also Brd4,

Cdk8, Cdk9 is correlated with H3K27ac.

So, next we ranked these different factors,

which can do better,

based on these ES identity genes.

This is the rank of super-enhancer

and we calculated the average rank.

And we found that Smad3 and p300

turned to be better then Med1.

Here we have genome browser screenshot for

all these factors we used at the gene locus Nanog.

So, to summarize this part.

We developed a model, which we call imPROSE,

which can accurately predict super-enhancers.

And we validated our model

using independent datasets

in four human cell-types.

And further we found that Cdk8

and Cadk9 are new features,

that can define super-enhancers.

And also we developed our pipeline

as a Python package,

which is available on GitHub for public.

So, we presented this in several conferences,

ISMB 2015, Cold Spring Harbor, Suzhou

and the manuscript for this work is currently

under revision for Genome Biology.

So, in the final part,

I will introduce to you a database of

super-enhancers,

which we call it dbSUPER.

So, the motivation behind this is that,

as I mentioned earlier that super-enhancers pay

a critical role in cell-identity & disease.

And several papers have generated

super-enhancers data,

but all is dumped in supplementary files.

And currently a catalog for super-enhancers

in mouse cells is lacking.

And also there is a need to develop a database

to streamline downstream analysis

and to help the research community.

So, we developed this database,

here is the workflow.

We used also the published pipeline

and carefully curated data generated by other labs

We stored this in a MySQL database.

We provide a user-friendly website,

for the users, on our server,

which comes with several features.

It is linked with external resources

and also linked with other webservers.

It also has a user interactive interface.

It has fast searching/browsing.

Data can be downloaded in several formats.

And also have an overlap analysis tool.

So I will go through these features, quickly,

in next slides.

So, first we created a map of super-enhancers

in mouse genome.

We used ChIP-seq data

for H3K27ac from mouseENCODE.

It is the distribution of super

and typical enhancers,

we have 7% super-enhancers

and 93% are typical enhancers.

Here is genome browser screenshot for

different histone modification,

and also Pol II, CTCF

and RNA-seq at super and typical enhancers.

So,

our database provides a responsive user interface,

by responsive I mean,

you can use our database even on your tablet

or on your smartphone.

dbSUPER provides very fast searching

and browsing facilities.

You can easily view the data in these nice tables.

And once you click a specific super-enhancer,

it will show you all these details.

The super-enhancer is linked

with eternal resources,

you can download the fasta sequence

and also the wig file.

It will tell you the details,

how this super-enhancer has been identified,

which data has been used to

identify this super-enhancer.

Further,

it provides easy download and import features.

You can download our data in BED, FASTA

and UCSC genome browser tracks.

And you can import to other web-servers,

like GREAT to perform gene ontology analysis,

pathway analysis & other analysis.

It connects with Cistrome

and also Galaxy server

and also UCSC genome browser.

Further, if you have personal galaxy server,

if you want to redirect all the data,

you just put your URL here,

all the data will be redirected to

your personal Galaxy on one click.

Further, it provides this overlap analysis.

If you have a list of regions,

you are interested in.

You upload a BED file

and you will set an overlap threshold.

It will show you this nice plot

and also all the list of super-enhancers

which has been overlapped with your regions,

which can be downloaded as an Excel file.

So,

we got a very active user community

across more than 100 countries.

We have highest number (users),

4000 from US, China, UK.

Until today we got 55,000 page views an also

our database has been used,

we got 13 citations and

mentioned in several journal papers.

So, to sum this part.

We created a map of super-enhancers

in mouse genome.

And also we developed the first database of

super-enhancers

in the mouse and human genome,

which is comprehensive, integrated

(it is linked with other resources)

and provide an interactive user interface.

Currently, our database have 82,000 super-enhancers

in 102 human and 25 mouse cell and tissue types.

So the database is freely available

on our web-server.

And we already published

this manuscript in NAR 2016.

So, to conclude all these three parts.

We found significant differences between super

and stretch enhancers

in terms of sequence,

chromatin modification profile

and also RNA Pol II occupancy.

And their ability to

express cell-type-specific genes.

And further

by integrating several types of datasets,

we developed a computational model,

which we call it imPROSE,

which can accurately predict super-enhancers.

And further we found Smad3, Cdk8

and Cdk9 as novel signatures of super-enhancers,

which can be used to identify super-enhancers.

And developed the first comprehensive database of

super-enhancers.

With all this,

we extended the current understanding of

super-enhancer research

and also our developed resources

(imPROSE and dbSUPER)

can help the wider research community to

answer several outstanding questions.

So, this is the research out.

We have three papers.

We presented all this work in several conferences.

Here is the URLs for these resources

and here is another resource,

which I didn’t mentioned here.

So, I will thank my supervisor Prof. Zhang Xuegong

for his help and support during

all these four years.

Without his help

I wouldn’t be standing here today.

I would like to thank also the thesis committee

and the reviewer of

my thesis for their useful suggestions.

And I would like to thank Rick Young from MIT,

Bing Ren from UCSD and Wei Xie from Tsinghua

for their useful suggestions

on some parts of my projects.

I would like to

thank International Student Office,

Tsinghua University

and also Chinese Scholarship Council

for awarding a fully-funded scholarship to support

my PhD studies here at Tsinghua.

I would to thank the Boinformatics Division,

the faculty, staff and all the students

and especially the students at Prof. Zhang’s lab

for stimulating discussions.

I would like to thank all my friends, family

and last but not least

I would like to thank my wife,

she is sitting here,

for her support and

understanding for these four years.

So with this I will stop here,

and thank you

and I will take your questions.

2016年清华大学研究生学位论文答辩（一）课程列表：

第一周化学系工程系——胡杨

-个人答辩陈述

--个人答辩陈述

-问题及答辩结果

--问题及答辩结果

第一周化学系——张淼

-个人答辩陈述

--Video

-问题及答辩结果

--问题及答辩结果

第一周化学系——张天

-个人答辩陈述

--个人答辩及陈述

-问题及答辩结果

--问题及答辩结果

第一周化学系——严波

-个人答辩及陈述

--个人答辩及陈述

-问答及答辩结果

--问答及答辩结果

第一周化学系——徐俊

-个人答辩及陈述

--个人答辩陈述

-问题及答辩结果

--问题及答辩结果

第一周化学系——曹玮

-个人答辩陈述

--个人答辩陈述

-问题及答辩结果

--问答及答辩结果

第一周化学系——陈骥

-个人答辩陈述

--个人答辩陈述

-问答及答辩结果

--问题及答辩结果

第一周化学系——王丽达

-个人答辩陈述

--个人答辩陈述

-问答及答辩结果

--问题及答辩结果

第一周化学系——李闯

-个人答辩陈述

--Video

-问答及答辩结果

--问题及答辩结果

第二周热能系——王翱

-个人答辩陈述

--Video

-问答及答辩结果

--问答及答辩结果

第二周热能系——付世龙

-个人答辩陈述

--个人答辩陈述

-问答及答辩结果

--问答及答辩结果

第二周热能系——余景文

-个人答辩陈述

--个人答辩陈述

-问答及答辩结果

--问题及答辩结果

第二周热能系——刘雨廷

-个人答辩陈述

--个人答辩陈述

-问答及答辩结果

--问答及答辩结果

第二周热能系——维克多

-个人答辩陈述

--Video

-问题回答及答辩结果

--问题及答辩结果

第二周热能系——孙宏明

-个人答辩陈述

--个人答辩陈述

第二周热能系——徐雷

-个人答辩陈述

--个人答辩陈述

-问答及答辩结果

--问答及答辩结果

第二周热能系——袁野

-个人答辩陈述

--个人答辩陈述

第二周热能系——宗毅晨

-个人答辩陈述

--个人答辩陈述

-问答及答辩结果

--问题及答辩结果

第三周自动化系——Aziz

-个人答辩陈述

--个人答辩陈述

-问答及答辩结果

--问题及答辩结果

第三周自动化系——冯会娟

-个人答辩陈述

--个人答辩陈述

-问答及及答辩结果

--问题及答辩结果

第三周自动化系——刘洋

-个人答辩陈述

--个人答辩陈述

-问答及答辩结果

--问答及答辩结果

第三周自动化系——马晨光

-个人答辩陈述

--个人答辩陈述

-问答及答辩结果

--问题及答辩结果

第三周自动化系——史建涛

-个人答辩陈述

--个人答辩陈述

-问答及答辩结果

--问题及答辩结果

第三周自动化系——吴佳欣

-个人答辩陈述

--个人答辩陈述

-问答及答辩结果

--问题及答辩结果

第三周自动化系——王婷婷

-个人答辩陈述

--个人答辩陈述

-问答及答辩结果

--问题及答辩结果

第三周自动化系——尚超

-个人答辩陈述

--个人答辩陈述

-问答及答辩结果

--问答及答辩结果

第四周自动化系——郑小龙

-个人答辩陈述

--Video

-问答及答辩结果

--Video

-个人学术感言

--Video

第四周机械系——张志刚

-个人答辩陈述

--Video

-问答及答辩结果

--Video

-个人学术感言

--Video

第四周热能系——王卫良

-个人答辩陈述

--Video

-问答及答辩结果

--Video

-个人学术感言

--Video

第四周自动化系——祖松鹏

-个人答辩陈述

--Video

-问答及答辩结果

--Video

-个人学术感言

--Video

个人答辩陈述在线视频