当前课程知识点:Learn Statistics with Ease > Chapter 9 Correlation and Regression Analysis > 9.4Correlative relations of determination > 9.4.1 Regression analysis: basic ideas
返回《Learn Statistics with Ease》慕课在线视频课程列表
返回《Learn Statistics with Ease》慕课在线视频列表
好 那么这节我们继续讲概率模型
Well, let’s continue the topic on the probabilistic model in this lecture
其实要想比较深入的
Actually, to understand the probabilistic model
去理解概率模型
in more depth
我们还要从方差分析开始讲
we shall get started with variance analysis
要从方差分析里面的基本的东西
and transit from the basic stuffs in variance analysis
然后开始进入到回归分析这里面去
to regression analysis
现在在PPT上有一个例子
Here is an example in the PPT
这例子是我们曾经在校园里面
This example involves a survey
做过一次调查
we once conducted on campus
这调查是问大家
It asked about everyone’s
对于这个校园的饭堂的满意度
satisfaction with the canteen on campus
那我们可以看到在这个调查当中
In this survey, we can notice
这个X写的是同学们的口味特点
X denotes students’ characteristics of tastes
这里面有四种口味
There are four tastes:
一种口味是无肉不欢
The first is no meat no pleasure
一种口味是必须放盐
Another is salt-is-a-must
第三种口味是不怕辣怕不辣
The third taste is not-afraid-of-spiciness
那么前面这几种口味
The above tastes
都是口味比较重的同学
are all relatively strong
第四种口味他们是比较清淡的
while the fourth taste is relatively light
就是最爱清水煮白菜
such as the taste for cabbages cooked in freshwater
然后这个纵轴就是满意度
The vertical axis denotes satisfaction
数据就是这样一个数据
Such are the data
我们把它用散点图的形式表现出来
and we exhibit them in the form of scatter plot
那么对于这样的数据
For such data
在方差分析当中我们要去做的话
how shall we process them
我们是怎么做的呢
in variance analysis
我们是把每个观测值
We break down the distance between each observed value
和总体均值的距离
and the population mean
分解成观测值和组均值距离
into the distance between the observed value and class mean
以及组均值和总体均值的距离
as well as the distance between the class mean and population mean
也就是说这是第一组
In other words, we first break down the first class
无肉不欢我们先把它分解出来
no meat no pleasure
一个是组均值和总体均值的距离
One distance is between the class mean and population mean
那么组均值和总体均值的距离
So the distance between class mean and population mean
也就是二者差的平方和
is the sum of squares of differences between both
我们把它叫做SSB
We call it SSB
叫组间平方和
sum of squares between classes
那么这是第一组的
This is for the first class
无肉不欢这组的
the no-meat-no-pleasure class
然后我们再看一组的
Next, we examine
这是第二组的
the second class
是必须放盐这组的
the salt-is-a-must class
那么这是组间平方和
This is the sum of squares between classes
那么如果我们要去区分
To distinguish
这个组内平方和
the sum of squares within this class
我们就是看在组内每个观测值
we shall examine the distance between each observed value
与组均值之间的距离
and the class mean within the class
好 这是第二组里面的一个数据
Well, this is a datum in the second class
对 也是观测值和组均值之间的距离
Well, it is also the distance between the observed value and class mean
那么这样的一个在方差分析当中
In such a variance analysis
我们先计算观测值与组均值
we first calculate the sum of squares of differences
差的平方和
between the observed values and the class mean
然后再计算组均值与总体均值的
and then the sum of squares of differences
差的平方和
between the class mean and population mean
我们分别得到SSB和SSW
to obtain SSB and SSW, respectively
那么得到SSB和SSW之后
After getting SSB and SSW
我们再去除以各自的自由度
we divide each by its degree of freedom
就得到了均方
to obtain the mean squares
一个叫组间均方
One is called mean square between classes
一个叫组内均方
the other mean square within class
那我们原来在介绍方差分析的时候
We mentioned this issue previously
其实说过这个问题
while introducing the variance analysis
那么组间均方和组内均方
Both mean square between classes and mean square within the class
都是对总体方差的无偏估计值
are unbiased estimators of population variance
只是说我们估计的方式不一样
only except that we estimate in different manners
就是组间均方是利用的什么呢
What does the mean square between classes use
利用的是中心极限定理
It uses the central limit theorem
那么组内均方
Then what does the mean square within class
也就是这个MSW是利用的什么呢
namely MSW, use
是利用的方差的加权平均值
It uses the weighted mean of variances
也就是说这里面的每一个组
In other words, each class herein
我们都视为一个样本方差
is deemed as a sample variance
那么样本方差的
So how many
如果我们要用多个样本方差
sample variances shall we need
去做总体方差的估计值
to make the estimator of population variance
我们就把这些样本方差做加权平均
We simply take the weighted mean of these sample variances
得到了一个总体方差的
to obtain an estimator of the population variance
因为我们使用了更多了信息
Since we use additional information
所以这样的一个样本方差的
the estimator of such a sample variance
估计会比使用单个的样本方差
would be a little better than
估计会更好一点
if a single sample variance were used
那么既然这两种方式
Since both methods
都可以作为总体方差的估计值
can make the estimator of population variance
如果说均值和样本均值
if the distance between the mean
和样本均值之间的距离比较大
and the sample mean is great
也就是违反虚无假设的情况下
namely in the case of going against the null hypothesis
那么组间均方的值就会比较大
then the value of mean square between classes would be great
如果说样本均值和样本均值
If the distance between the sample means
之间的距离比较小
is small
也就是说这些样本
namely these samples
都来自于共同的总体
all stem from a common population
那么这个时候
then
组间均方和组内均方这两者
the values of population variance estimated
他们用不同的方法
by different methods
估计出来的总体方差的值
for mean square between classes and mean square within the class
就会比较接近
would be close
那么二者之比
The ratio between both
也就是我们最后得到的那个F的值
is the value of F we obtain finally
那么这个F的值会接近于1
which would be close to 1
如果F的值接近于一
In this case
我们就说组均值和组均值之间
we say there is no significant difference
没有显著的差异
between class means
那么如果说这值比一大的比较多
If this value is far greater than 1
我们就说组均值和组均值之间
we say there is significant difference
有显著的差异
between class means
当然我们还不知道说
Of course, we don’t even know
这个差异究竟出现在哪两组之间
between which two classes this difference exists exactly
那这个时候我们就拒绝H0
At this moment we reject H0
这是我们原来我们在方差分析里面
This is the issue we discussed
去讨论的问题
previously in variance analysis
那么现在到了回归分析以后
Back to regression analysis now
这个问题有没有发生实质性的变化
this issue has not undergone substantial changes
我们现在还看一下这个数据
Let’s focus on the data again
在这个图里面数据是没有变的
The data in this plot remain unchanged
只是说我们的自变量变了
only except for the independent variable
原来的自变量是口味特点
While the original independent variable is characteristic of taste
现在的自变量
the current independent variable
是大师傅在菜里放了多少盐
is how much salt the chef has added to the dishes
那么大师傅在菜里放了多少盐
The amount of salt the chef has added to dishes
这是一个连续变量
is a continuous variable
连续变量就可以
which can
用回归分析的方法去进行分析了
be analyzed by the method of regression analysis
那现在我们可以看到
Now we can see
如果说我们想画出一条
if we want to draw a
线性的回归线的话
linear regression line
也能隐隐约约的从左下角
we can do so indistinctly from the bottom left corner
然后往右上角能够画出来
to the top right corner
那在这个里头你可以看到
Here you can find
当X为1的时候
when X is 1
Y有四个值
Y has four values
X为2的时候
when X is 2
Y有五个值
Y has five values
那么为3的时候有六个值
when X is 3 Y has six values
X为4的时候有两个值
when X is 4 Y has two values
这个就是像我们之前
This is as we said
在概率模型里面说的
in the previous probabilistic model
当给定X值的时候
When the value of X is given
Y的值是不确定的
the value of Y is indefinite
Y的值是服从一个概率分布
and obeys a certain probability distribution
在这个例子当中
In this example
我们可以看到
we can see
当X为1的时候
when X is 1
Y的取值可以
Y can vary
像我们在方差分析里面讨论的那样
as discussed in variance analysis
也视作一个样本
It is also deemed as a sample
也就是说当X为1的时候
In other words, when X is 1
我从Y当中去抽取了一个样本
we draw a sample from Y
当X为2的时候
when X is 2
我从Y当中也抽取了一个样本
we draw another sample from Y
X为3 X为4
when X is 3 or 4
我也都分别抽取了样本
we draw a separate sample
那每个样本
Then we can calculate the variance
我也都可以去计算它的方差
of each sample
那么这样的话
This way
我也可以得到
I can still obtain
一个样本方差的估计值
the estimator of a sample variance
那么我们有没有什么办法
So is there any solution for us
可以类似于像我们在方差分析当中
to calculate a MS value
去计算组间均方一样
similar to the way
去计算一个类似的
we calculated the mean square between classes
一个这样的一个MS的值
in variance analysis
那么这个时候我们可以看一下
At this point, we can see
我们在回归分析当中
in regression analysis
我们其实是画了一条回归线的
we actually draw a regression line
那么这条回归线
which
是给定X的时候的Y的期望
represents the expectation of Y when X is given
那么我们也可以认为说是
and which can also be considered as
给定X的时候Y的均值
the mean of Y when X is given
而这个均值是什么均值呢
So what is this mean exactly
也就是给定X的时候
It is the mean of the conditional distribution of Y
Y的条件分布的均值
when X is given
而这个预测值
While the predictand
就会有点像我们之前
is a bit like
在方差分析当中
the concept of class mean we discussed
所讨论的组均值的概念
in the previous variance analysis
也就是说如果说我们把
In other words, if we deem
在回归分析当中的预测值
the predictand in regression analysis
看作组均值
as a class mean
我们就可以像在方差分析那样
then we can create a variance analysis table
来去对回归分析的数据
of the data under regression analysis
也做出一个方差分析表
as we did in variance analysis
我们用预测值减去总的均值
By subtracting the total mean from the predictand
就可以得到一个类似于
we obtain a sum of squares between classes
SSB这么一个组间的平方和
similar to SSB
那我们把它叫做回归平方和
which is called the sum of squares about regression
叫SSR
or SSR
那我们用每个观测值
By subtracting from each observed value
减去各自所对应的预测值
its corresponding predictand
就可以得到一个类似于SSW
we can obtain a sum of squares within class
这样一个组内的平方和
similar to SSW
我们把它叫做残差平方和SSE
which is called sum of squares of errors, or SSE
其实本质上是一样的
They are essentially the same
只是说我们用不同的符号
though we use different symbols
那么这样的话
This
也说意味着说
means
我们可以得到多个样本
we can acquire multiple samples
我们有多个样本的方差
We have the variances of multiple samples
我们可以使用多个样本方差的
We can use the weighted mean
加权平均值
of multiple sample variances
去做总体方差的估计值
to make the estimator of population variance
那么我有多个样本
With multiple samples
就可以得到样本均值分布
I can derive the sample mean distribution
而样本的均值就是给定X的时候
While the sample mean is the predictand of Y
Y的预测值
when X is given
那么样本均值与样本均值之间
So is there any difference
是不是有差异呢
between sample means
他们的值如果有显著差异
What does it mean
那么就意味着什么呢
that there is a significant difference between their values
意味着这条回归线的斜率不为零
That means the slope of the regression line is not zero
如果说这些预测值之间
If there is no significant difference
没有显著差异
between these predictands
那就意味着说无论X怎么变化
then it means however X varies
Y的预测值都是相同的
the predictand of Y remains the same
那我们就认为这条回归线
Thus we consider the regression line
其实是没有什么用的
is actually useless
X不能够去预测Y
since X cannot predict Y
那么类似于这样的一个思路
Similar to such an idea
那我们就可以得到
we can derive
一个像方差分析当中
something like the variance analysis table
得到的一个方差分析表
we derived in variance analysis
只是这个方差分析表
only except that such a variance analysis table
是出现在回归里的
emerges in regression analysis
那么我们在用预测值
As we substitute the predictand
替换掉原来的组均值以后
for the original class mean
SSW就换成了SSE
SSW is changed into SSE
那么SSB就换成了SSR
and SSB into SSR
那么自由度也发生了变化
and the degree of freedom also changes
最后我们就可以得到一个
Finally, we can obtain a
也可以相应的得到一个F的值
value of F correspondingly
那这里头也有两个MS
There are also two mean squares:
MSR 和MSE
MSR and MSE
组间的回归的均方和残差的均方
namely the mean square about regression and the mean square of errors
它们两个含义也是一样的
Having the same connotation
也都是对于总体方差的无偏估计值
both are unbiased estimators for population variance
那么这个时候我们可以看到
At this point, we can see
如果F的值很大
if the value of F is very great
那就说明预测值和预测值之间的
that indicates the distance
距离很大
between predictands is very great
那么预测值和预测值之间的
Since the distance
距离很大
between predictands is very great
你就可以想象到它们这条回归线
you can imagine the regression line
肯定是明显的不是一条水平线
is not a horizontal line
那么如果说MSR和MSE之间的
If the values are close
这个取值比较接近
between MSR and MSE
那我们就可以认为说
then we can consider
那么这条回归线大概是接近于零的
the regression line is roughly approximate to zero
因为预测值和预测值之间的距离
since the distance between predictands
是很小的
is very small
好 那么这是我们从方差分析到回归
Well, above is the transition from variance analysis to regression analysis
其实背后的思想都是一样的
Actually, the ideas behind them are alike
我们都是去估计总体方差
In either analysis, we estimate the population variance
然后看基于样本均值分布的
and then check whether there is a significance difference
得到的方差的估计值
between the estimator of variance obtained
和基于样本方差
based on sample mean distribution
加权平均数得到的方差的估计值
and the estimator of variance obtained
是不是有明显的差异
based on the weighted mean of sample variances
如果有明显的差异
If there is significant difference
我们就拒绝H0
we reject H0
如果没有 我们就接受H0
Otherwise we accept H0
-1.1 Applications in Business and Economics
--1.1.1 Statistics application: everywhere 统计应用:无处不在
-1.2 Data、Data Sources
--1.2.1 History of Statistical Practice: A Long Road 统计实践史:漫漫长路
-1.3 Descriptive Statistics
--1.3.1 History of Statistics: Learn from others 统计学科史:博采众长
--1.3.2 Homework 课后习题
-1.4 Statistical Inference
--1.4.1 Basic research methods: statistical tools 基本研究方法:统计的利器
--1.4.2 Homework课后习题
--1.4.3 Basic concepts: the cornerstone of statistics 基本概念:统计的基石
--1.4.4 Homework 课后习题
-1.5 Unit test 第一单元测试题
-2.1Summarizing Qualitative Data
--2.1.1 Statistical investigation: the sharp edge of mining raw ore 统计调查:挖掘原矿的利刃
-2.2Frequency Distribution
--2.2.1 Scheme design: a prelude to statistical survey 方案设计:统计调查的前奏
-2.3Relative Frequency Distribution
--2.3.1 Homework 课后习题
-2.4Bar Graph
--2.4.1 Homework 课后习题
-2.6 Unit 2 test 第二单元测试题
-Descriptive Statistics: Numerical Methods
-3.1Measures of Location
--3.1.1 Statistics grouping: from original ecology to systematization 统计分组:从原生态到系统化
--3.1.2 Homework 课后习题
-3.2Mean、Median、Mode
--3.2.2 Homework 课后习题
-3.3Percentiles
--3.3 .1 Statistics chart: show the best partner for data 统计图表:展现数据最佳拍档
--3.3.2 Homework 课后习题
-3.4Quartiles
--3.4.1 Calculating the average (1): Full expression of central tendency 计算平均数(一):集中趋势之充分表达
--3.4.2 Homework 课后习题
-3.5Measures of Variability
--3.5.1 Calculating the average (2): Full expression of central tendency 计算平均数(二):集中趋势之充分表达
--3.5.2 Homework 课后习题
-3.6Range、Interquartile Range、A.D、Variance
--3.6.1 Position average: a robust expression of central tendency 1 位置平均数:集中趋势之稳健表达1
--3.6.2 Homework 课后习题
-3.7Standard Deviation
--3.7.1 Position average: a robust expression of central tendency 2 位置平均数:集中趋势之稳健表达2
-3.8Coefficient of Variation
-3.9 unit 3 test 第三单元测试题
-4.1 The horizontal of time series
--4.1.1 Time series (1): The past, present and future of the indicator 时间序列 (一) :指标的过去现在未来
--4.1.2 Homework 课后习题
--4.1.3 Time series (2): The past, present and future of indicators 时间序列 (二) :指标的过去现在未来
--4.1.4 Homework 课后习题
--4.1.5 Level analysis: the basis of time series analysis 水平分析:时间数列分析的基础
--4.1.6Homework 课后习题
-4.2 The speed analysis of time series
--4.2.1 Speed analysis: relative changes in time series 速度分析:时间数列的相对变动
--4.2.2 Homework 课后习题
-4.3 The calculation of the chronological average
--4.3.1 Average development speed: horizontal method and cumulative method 平均发展速度:水平法和累积法
--4.3.2 Homework 课后习题
-4.4 The calculation of average rate of development and increase
--4.4.1 Analysis of Component Factors: Finding the Truth 构成因素分析:抽丝剥茧寻真相
--4.4.2 Homework 课后习题
-4.5 The secular trend analysis of time series
--4.5.1 Long-term trend determination, smoothing method 长期趋势测定,修匀法
--4.5.2 Homework 课后习题
--4.5.3 Long-term trend determination: equation method 长期趋势测定:方程法
--4.5.4 Homework 课后习题
-4.6 The season fluctuation analysis of time series
--4.6.1 Seasonal change analysis: the same period average method 季节变动分析:同期平均法
-4.7 Unit 4 test 第四单元测试题
-5.1 The Conception and Type of Statistical Index
--5.1.1 Index overview: definition and classification 指数概览:定义与分类
-5.2 Aggregate Index
--5.2.1 Comprehensive index: first comprehensive and then compare 综合指数:先综合后对比
-5.4 Aggregate Index System
--5.4.1 Comprehensive Index System 综合指数体系
-5.5 Transformative Aggregate Index (Mean value index)
--5.5.1 Average index: compare first and then comprehensive (1) 平均数指数:先对比后综合(一)
--5.5.2 Average index: compare first and then comprehensive (2) 平均数指数:先对比后综合(二)
-5.6 Average target index
--5.6.1 Average index index: first average and then compare 平均指标指数:先平均后对比
-5.7 Multi-factor Index System
--5.7.1 CPI Past and Present CPI 前世今生
-5.8 Economic Index in Reality
--5.8.1 Stock Price Index: Big Family 股票价格指数:大家庭
-5.9 Unit 5 test 第五单元测试题
-Sampling and sampling distribution
-6.1The binomial distribution
--6.1.1 Sampling survey: definition and several groups of concepts 抽样调查:定义与几组概念
-6.2The geometric distribution
--6.2.1 Probability sampling: common organizational forms 概率抽样:常用组织形式
-6.3The t-distribution
--6.3.1 Non-probability sampling: commonly used sampling methods 非概率抽样:常用抽取方法
-6.4The normal distribution
--6.4.1 Common probability distributions: basic characterization of random variables 常见概率分布:随机变量的基本刻画
-6.5Using the normal table
--6.5.1 Sampling distribution: the cornerstone of sampling inference theory 抽样分布:抽样推断理论的基石
-6.9 Unit 6 test 第六单元测试题
-7.1Properties of point estimates: bias and variability
--7.1.1 Point estimation: methods and applications 点估计:方法与应用
-7.2Logic of confidence intervals
--7.2.1 Estimation: Selection and Evaluation 估计量:选择与评价
-7.3Meaning of confidence level
--7.3.1 Interval estimation: basic principles (1) 区间估计:基本原理(一)
--7.3.2 Interval estimation: basic principles (2) 区间估计:基本原理(二)
-7.4Confidence interval for a population proportion
--7.4.1 Interval estimation of the mean: large sample case 均值的区间估计:大样本情形
--7.4.2 Interval estimation of the mean: small sample case 均值的区间估计:小样本情形
-7.5Confidence interval for a population mean
--7.5.1 Interval estimation of the mean: small sample case 区间估计:总体比例和方差
-7.6Finding sample size
--7.6.1 Determination of sample size: a prelude to sampling (1) 样本容量的确定:抽样的前奏(一)
--7.6.2 Determination of sample size: a prelude to sampling (2) 样本容量的确定:抽样的前奏(二)
-7.7 Unit 7 Test 第七单元测试题
-8.1Forming hypotheses
--8.1.1 Hypothesis testing: proposing hypotheses 假设检验:提出假设
-8.2Logic of hypothesis testing
--8.2.1 Hypothesis testing: basic ideas 假设检验:基本思想
-8.3Type I and Type II errors
--8.3.1 Hypothesis testing: basic steps 假设检验:基本步骤
-8.4Test statistics and p-values 、Two-sided tests
--8.4.1 Example analysis: single population mean test 例题解析:单个总体均值检验
-8.5Hypothesis test for a population mean
--8.5.1 Analysis of examples of individual population proportion and variance test 例题分析 单个总体比例及方差检验
-8.6Hypothesis test for a population proportion
--8.6.1 P value: another test criterion P值:另一个检验准则
-8.7 Unit 8 test 第八单元测试题
-Correlation and regression analysis
-9.1Correlative relations
--9.1.1 Correlation analysis: exploring the connection of things 相关分析:初探事物联系
--9.1.2 Correlation coefficient: quantify the degree of correlation 相关系数:量化相关程度
-9.2The description of regression equation
--9.2.1 Regression Analysis: Application at a Glance 回归分析:应用一瞥
-9.3Fit the regression equation
--9.3.1 Regression analysis: equation establishment 回归分析:方程建立
-9.4Correlative relations of determination
--9.4.1 Regression analysis: basic ideas
--9.4.2 Regression analysis: coefficient estimation 回归分析:系数估计
-9.5The application of regression equation