9.4.1 Regression analysis: basic ideas慕课视频播放-Learn Statistics with Ease-MOOC慕课视频教程-柠檬大学

好那么这节我们继续讲概率模型
Well, let’s continue the topic on the probabilistic model in this lecture

其实要想比较深入的
Actually, to understand the probabilistic model

去理解概率模型
in more depth

我们还要从方差分析开始讲
we shall get started with variance analysis

要从方差分析里面的基本的东西
and transit from the basic stuffs in variance analysis

然后开始进入到回归分析这里面去
to regression analysis

现在在PPT上有一个例子
Here is an example in the PPT

这例子是我们曾经在校园里面
This example involves a survey

做过一次调查
we once conducted on campus

这调查是问大家
It asked about everyone’s

对于这个校园的饭堂的满意度
satisfaction with the canteen on campus

那我们可以看到在这个调查当中
In this survey, we can notice

这个X写的是同学们的口味特点
X denotes students’ characteristics of tastes

这里面有四种口味
There are four tastes:

一种口味是无肉不欢
The first is no meat no pleasure

一种口味是必须放盐
Another is salt-is-a-must

第三种口味是不怕辣怕不辣
The third taste is not-afraid-of-spiciness

那么前面这几种口味
The above tastes

都是口味比较重的同学
are all relatively strong

第四种口味他们是比较清淡的
while the fourth taste is relatively light

就是最爱清水煮白菜
such as the taste for cabbages cooked in freshwater

然后这个纵轴就是满意度
The vertical axis denotes satisfaction

数据就是这样一个数据
Such are the data

我们把它用散点图的形式表现出来
and we exhibit them in the form of scatter plot

那么对于这样的数据
For such data

在方差分析当中我们要去做的话
how shall we process them

我们是怎么做的呢
in variance analysis

我们是把每个观测值
We break down the distance between each observed value

和总体均值的距离
and the population mean

分解成观测值和组均值距离
into the distance between the observed value and class mean

以及组均值和总体均值的距离
as well as the distance between the class mean and population mean

也就是说这是第一组
In other words, we first break down the first class

无肉不欢我们先把它分解出来
no meat no pleasure

一个是组均值和总体均值的距离
One distance is between the class mean and population mean

那么组均值和总体均值的距离
So the distance between class mean and population mean

也就是二者差的平方和
is the sum of squares of differences between both

我们把它叫做SSB
We call it SSB

叫组间平方和
sum of squares between classes

那么这是第一组的
This is for the first class

无肉不欢这组的
the no-meat-no-pleasure class

然后我们再看一组的
Next, we examine

这是第二组的
the second class

是必须放盐这组的
the salt-is-a-must class

那么这是组间平方和
This is the sum of squares between classes

那么如果我们要去区分
To distinguish

这个组内平方和
the sum of squares within this class

我们就是看在组内每个观测值
we shall examine the distance between each observed value

与组均值之间的距离
and the class mean within the class

好这是第二组里面的一个数据
Well, this is a datum in the second class

对也是观测值和组均值之间的距离
Well, it is also the distance between the observed value and class mean

那么这样的一个在方差分析当中
In such a variance analysis

我们先计算观测值与组均值
we first calculate the sum of squares of differences

差的平方和
between the observed values and the class mean

然后再计算组均值与总体均值的
and then the sum of squares of differences

差的平方和
between the class mean and population mean

我们分别得到SSB和SSW
to obtain SSB and SSW, respectively

那么得到SSB和SSW之后
After getting SSB and SSW

我们再去除以各自的自由度
we divide each by its degree of freedom

就得到了均方
to obtain the mean squares

一个叫组间均方
One is called mean square between classes

一个叫组内均方
the other mean square within class

那我们原来在介绍方差分析的时候
We mentioned this issue previously

其实说过这个问题
while introducing the variance analysis

那么组间均方和组内均方
Both mean square between classes and mean square within the class

都是对总体方差的无偏估计值
are unbiased estimators of population variance

只是说我们估计的方式不一样
only except that we estimate in different manners

就是组间均方是利用的什么呢
What does the mean square between classes use

利用的是中心极限定理
It uses the central limit theorem

那么组内均方
Then what does the mean square within class

也就是这个MSW是利用的什么呢
namely MSW, use

是利用的方差的加权平均值
It uses the weighted mean of variances

也就是说这里面的每一个组
In other words, each class herein

我们都视为一个样本方差
is deemed as a sample variance

那么样本方差的
So how many

如果我们要用多个样本方差
sample variances shall we need

去做总体方差的估计值
to make the estimator of population variance

我们就把这些样本方差做加权平均
We simply take the weighted mean of these sample variances

得到了一个总体方差的
to obtain an estimator of the population variance

因为我们使用了更多了信息
Since we use additional information

所以这样的一个样本方差的
the estimator of such a sample variance

估计会比使用单个的样本方差
would be a little better than

估计会更好一点
if a single sample variance were used

那么既然这两种方式
Since both methods

都可以作为总体方差的估计值
can make the estimator of population variance

如果说均值和样本均值
if the distance between the mean

和样本均值之间的距离比较大
and the sample mean is great

也就是违反虚无假设的情况下
namely in the case of going against the null hypothesis

那么组间均方的值就会比较大
then the value of mean square between classes would be great

如果说样本均值和样本均值
If the distance between the sample means

之间的距离比较小
is small

也就是说这些样本
namely these samples

都来自于共同的总体
all stem from a common population

那么这个时候
then

组间均方和组内均方这两者
the values of population variance estimated

他们用不同的方法
by different methods

估计出来的总体方差的值
for mean square between classes and mean square within the class

就会比较接近
would be close

那么二者之比
The ratio between both

也就是我们最后得到的那个F的值
is the value of F we obtain finally

那么这个F的值会接近于1
which would be close to 1

如果F的值接近于一
In this case

我们就说组均值和组均值之间
we say there is no significant difference

没有显著的差异
between class means

那么如果说这值比一大的比较多
If this value is far greater than 1

我们就说组均值和组均值之间
we say there is significant difference

有显著的差异
between class means

当然我们还不知道说
Of course, we don’t even know

这个差异究竟出现在哪两组之间
between which two classes this difference exists exactly

那这个时候我们就拒绝H0
At this moment we reject H0

这是我们原来我们在方差分析里面
This is the issue we discussed

去讨论的问题
previously in variance analysis

那么现在到了回归分析以后
Back to regression analysis now

这个问题有没有发生实质性的变化
this issue has not undergone substantial changes

我们现在还看一下这个数据
Let’s focus on the data again

在这个图里面数据是没有变的
The data in this plot remain unchanged

只是说我们的自变量变了
only except for the independent variable

原来的自变量是口味特点
While the original independent variable is characteristic of taste

现在的自变量
the current independent variable

是大师傅在菜里放了多少盐
is how much salt the chef has added to the dishes

那么大师傅在菜里放了多少盐
The amount of salt the chef has added to dishes

这是一个连续变量
is a continuous variable

连续变量就可以
which can

用回归分析的方法去进行分析了
be analyzed by the method of regression analysis

那现在我们可以看到
Now we can see

如果说我们想画出一条
if we want to draw a

线性的回归线的话
linear regression line

也能隐隐约约的从左下角
we can do so indistinctly from the bottom left corner

然后往右上角能够画出来
to the top right corner

那在这个里头你可以看到
Here you can find

当X为1的时候
when X is 1

Y有四个值
Y has four values

X为2的时候
when X is 2

Y有五个值
Y has five values

那么为3的时候有六个值
when X is 3 Y has six values

X为4的时候有两个值
when X is 4 Y has two values

这个就是像我们之前
This is as we said

在概率模型里面说的
in the previous probabilistic model

当给定X值的时候
When the value of X is given

Y的值是不确定的
the value of Y is indefinite

Y的值是服从一个概率分布
and obeys a certain probability distribution

在这个例子当中
In this example

我们可以看到
we can see

当X为1的时候
when X is 1

Y的取值可以
Y can vary

像我们在方差分析里面讨论的那样
as discussed in variance analysis

也视作一个样本
It is also deemed as a sample

也就是说当X为1的时候
In other words, when X is 1

我从Y当中去抽取了一个样本
we draw a sample from Y

当X为2的时候
when X is 2

我从Y当中也抽取了一个样本
we draw another sample from Y

X为3 X为4
when X is 3 or 4

我也都分别抽取了样本
we draw a separate sample

那每个样本
Then we can calculate the variance

我也都可以去计算它的方差
of each sample

那么这样的话
This way

我也可以得到
I can still obtain

一个样本方差的估计值
the estimator of a sample variance

那么我们有没有什么办法
So is there any solution for us

可以类似于像我们在方差分析当中
to calculate a MS value

去计算组间均方一样
similar to the way

去计算一个类似的
we calculated the mean square between classes

一个这样的一个MS的值
in variance analysis

那么这个时候我们可以看一下
At this point, we can see

我们在回归分析当中
in regression analysis

我们其实是画了一条回归线的
we actually draw a regression line

那么这条回归线
which

是给定X的时候的Y的期望
represents the expectation of Y when X is given

那么我们也可以认为说是
and which can also be considered as

给定X的时候Y的均值
the mean of Y when X is given

而这个均值是什么均值呢
So what is this mean exactly

也就是给定X的时候
It is the mean of the conditional distribution of Y

Y的条件分布的均值
when X is given

而这个预测值
While the predictand

就会有点像我们之前
is a bit like

在方差分析当中
the concept of class mean we discussed

所讨论的组均值的概念
in the previous variance analysis

也就是说如果说我们把
In other words, if we deem

在回归分析当中的预测值
the predictand in regression analysis

看作组均值
as a class mean

我们就可以像在方差分析那样
then we can create a variance analysis table

来去对回归分析的数据
of the data under regression analysis

也做出一个方差分析表
as we did in variance analysis

我们用预测值减去总的均值
By subtracting the total mean from the predictand

就可以得到一个类似于
we obtain a sum of squares between classes

SSB这么一个组间的平方和
similar to SSB

那我们把它叫做回归平方和
which is called the sum of squares about regression

叫SSR
or SSR

那我们用每个观测值
By subtracting from each observed value

减去各自所对应的预测值
its corresponding predictand

就可以得到一个类似于SSW
we can obtain a sum of squares within class

这样一个组内的平方和
similar to SSW

我们把它叫做残差平方和SSE
which is called sum of squares of errors, or SSE

其实本质上是一样的
They are essentially the same

只是说我们用不同的符号
though we use different symbols

那么这样的话
This

也说意味着说
means

我们可以得到多个样本
we can acquire multiple samples

我们有多个样本的方差
We have the variances of multiple samples

我们可以使用多个样本方差的
We can use the weighted mean

加权平均值
of multiple sample variances

去做总体方差的估计值
to make the estimator of population variance

那么我有多个样本
With multiple samples

就可以得到样本均值分布
I can derive the sample mean distribution

而样本的均值就是给定X的时候
While the sample mean is the predictand of Y

Y的预测值
when X is given

那么样本均值与样本均值之间
So is there any difference

是不是有差异呢
between sample means

他们的值如果有显著差异
What does it mean

那么就意味着什么呢
that there is a significant difference between their values

意味着这条回归线的斜率不为零
That means the slope of the regression line is not zero

如果说这些预测值之间
If there is no significant difference

没有显著差异
between these predictands

那就意味着说无论X怎么变化
then it means however X varies

Y的预测值都是相同的
the predictand of Y remains the same

那我们就认为这条回归线
Thus we consider the regression line

其实是没有什么用的
is actually useless

X不能够去预测Y
since X cannot predict Y

那么类似于这样的一个思路
Similar to such an idea

那我们就可以得到
we can derive

一个像方差分析当中
something like the variance analysis table

得到的一个方差分析表
we derived in variance analysis

只是这个方差分析表
only except that such a variance analysis table

是出现在回归里的
emerges in regression analysis

那么我们在用预测值
As we substitute the predictand

替换掉原来的组均值以后
for the original class mean

SSW就换成了SSE
SSW is changed into SSE

那么SSB就换成了SSR
and SSB into SSR

那么自由度也发生了变化
and the degree of freedom also changes

最后我们就可以得到一个
Finally, we can obtain a

也可以相应的得到一个F的值
value of F correspondingly

那这里头也有两个MS
There are also two mean squares:

MSR 和MSE
MSR and MSE

组间的回归的均方和残差的均方
namely the mean square about regression and the mean square of errors

它们两个含义也是一样的
Having the same connotation

也都是对于总体方差的无偏估计值
both are unbiased estimators for population variance

那么这个时候我们可以看到
At this point, we can see

如果F的值很大
if the value of F is very great

那就说明预测值和预测值之间的
that indicates the distance

距离很大
between predictands is very great

那么预测值和预测值之间的
Since the distance

距离很大
between predictands is very great

你就可以想象到它们这条回归线
you can imagine the regression line

肯定是明显的不是一条水平线
is not a horizontal line

那么如果说MSR和MSE之间的
If the values are close

这个取值比较接近
between MSR and MSE

那我们就可以认为说
then we can consider

那么这条回归线大概是接近于零的
the regression line is roughly approximate to zero

因为预测值和预测值之间的距离
since the distance between predictands

是很小的
is very small

好那么这是我们从方差分析到回归
Well, above is the transition from variance analysis to regression analysis

其实背后的思想都是一样的
Actually, the ideas behind them are alike

我们都是去估计总体方差
In either analysis, we estimate the population variance

然后看基于样本均值分布的
and then check whether there is a significance difference

得到的方差的估计值
between the estimator of variance obtained

和基于样本方差
based on sample mean distribution

加权平均数得到的方差的估计值
and the estimator of variance obtained

是不是有明显的差异
based on the weighted mean of sample variances

如果有明显的差异
If there is significant difference

我们就拒绝H0
we reject H0

如果没有我们就接受H0
Otherwise we accept H0

9.4.1 Regression analysis: basic ideas在线视频

9.4.1 Regression analysis: basic ideas课程教案、知识点、字幕

Learn Statistics with Ease课程列表：

Chapter 1 Data and Statistics

Chapter 2 Descriptive Statistics: Tabular and Graphical Methods

Chapter 3 Descriptive Statistics: Numerical Methods

Chapter 4 Time Series Analysis

Chapter 5 Statistical Index

Chapter 6 Sampling Distributions

Chapter 7 Confidence Intervals

Chapter 8: Hypothesis Tests

Chapter 9 Correlation and Regression Analysis

9.4.1 Regression analysis: basic ideas笔记与讨论

也许你还感兴趣的课程: