当前课程知识点:Learn Statistics with Ease > Chapter 9 Correlation and Regression Analysis > 9.5The application of regression equation > 9.5.1 Regression analysis: model evaluation 回归分析:模型评价
返回《Learn Statistics with Ease》慕课在线视频课程列表
返回《Learn Statistics with Ease》慕课在线视频列表
好 那么这节我们来说
Well, in this section let’s talk about
回归方程的解释力
the explanatory power of regression equation
就像我们刚才
Just as we said
在上一节里面我们说的
in the previous section
我们可以把这条回归线画出来
we can draw out the regression line
而且我们知道
Moreover, we know
我们画出来的这条回归线
the regression line we draw
是x和y它们之间关系的
is the line of best fit
最好的一条拟合线
for the relation between x and y
可是这个x到底能够解释y多少
However, exactly how much can x explain y
其实我们还是不知道的
We still don’t know
因为对于任何一个x和y
This is because for any x and y
我们都可以画出一条这样的线
we can always draw one such line
那么不同的x
but the explanatory power over y is different
其实对于y的解释力是不同的
across different x
那么有的x可以解释y的多一点
Some x can explain y a bit more
有的x可以解释的可能就要少一点
while some x may explain y a bit less
就像我们在前面
Just as we previously
我们讨论过一个例子
discussed in an example
我们说智商
we said IQ
然后还有你的高考当天的
and the traffic condition and weather condition
交通的状况 天气的状况
on the very day of your college entrance examination
那这些都可以预测你的高考成绩
might predict your score in the examination
那么究竟是哪个自变量
So exactly which independent variable
能够解释的好呢
is capable to explain better
我们就要比较这些x的解释力的大小
We shall compare the x in terms of the magnitude of explanatory power
那我们怎么才能够计算到
So how can we figure out
这个x的解释力的大小呢
the magnitude of the explanatory power of some x
我们之前在介绍回归方程的时候
While introducing the regression equation before
我们说观测值和均值的距离
we said we could break down
可以对它进行分解
the distance between the observed value and mean
可以分解成两块
into two parts
一块是观测值和预测值的距离
One is the distance between the observed values and predictands
另外一块是预测值和均值的距离
the other being the distance between the predictands and mean
那么在最左边的这块
For the leftmost part
他们的这个平方和
the sum of squares
也就是观测值和均值的差的平方和
namely the sum of squares of differences between the observed values and mean
我们把它叫做SST
is called SST
叫做和方 总的和方
or (total) sum square
那么在后面这个式子
Amid the latter formula
当中中间这块
the component in the middle
也就是说观测值和预测值
namely the sum of squares of differences
差的平方和
between the observed values and predictands
我们把它叫做残差的平方和
is called the sum of squares of errors
也就是SSE
namely SSE
然后预测值和均值差的平方和
Next, the sum of squares of differences between the predictands and mean
我们把它叫做SSR
is called SSR
那么就叫回归平方和
namely the sum of squares about regression
其实这个式子来说
Actually
留心一点的同学可能会发现
careful students may find
如果说我们把这个式子
if we square this formula
尤其是等式右端的式子
particularly the expression
做了平方以后
on the right side of the equality
那如果把它的展开以后
and after expanding it
其实还会有一个乘积项
there will actually be a product term
也就是观测值减去预测值
namely the observed value minus the predictand
再乘以预测值减去均值
times the predictand minus the mean
那这个已经证明过
This has been proved
说这个乘积项为零
Since this product term proves to be zero
所以去分解起来就会更方便
it would be more convenient to break it down
那我们就可以看到
Then we can see
那么就可以得到
and thus derive
总的平方和
the total sum of squares
等于残差平方和
equals the sum of squares of errors
加上回归平方和这样一个式子
Plus such a formula for the sum of squares about regression
那么既然SSE+SSR
since SSE+SSR
合起来等于SST
equals SST
那么SSE越小
the smaller the value of SSE
SSR的值就会越大
the greater the value of SSR
那么SSE越大
the greater the value of SSE
相反SSR的值就会越小
conversely, the smaller the value of SSR
我们是希望说
We hope to say
如果这个x对y的解释力比较好的话
if this x has better explanatory power over y
SSR的值会比较大
the value of SSR would be greater
那么我们怎么找到一个这样的指标
So how can we find such an index
来标记说SSR究竟有多大呢
of the exact magnitude of SSR
那么我们可以直接用SSR
We can obtain an index by
除以SST得到了一个指标
dividing SSR directly by SST
叫R的平方
which is called the square of R
R的平方在不同的教材当中的叫法
the square of R may be called differently
可能会有差别
in different textbooks
比如说有的书叫判定系数
For example, in some books, it is called the coefficient of judgment
也有的书叫确定系数
while in other books coefficient of determination
那么他们都指的是同样的一个内容
But they point to the same object
指的是R的平方
namely the square of R
那么R的平方的值
The value of the square of R
是在零和一之间的
ranges between 0 and 1
R方越大
The greater the value of the square of R
表示x对y的解释能力也就会越强
the greater the explanatory power of x over y
我们可以把这个R方的式子
We can also expand
把它展开来看一看
the formula for the square of R and have a look
那么R方等于SSR除以SST
the square of R equals SSR divided by SST
那么就是我们现在看到
as shown by the fraction
这个分式的样子
we are seeing now
如果我们把这个式子
If we expand this formula
一层一层的展开
layer by layer
那么最后我们可以展开成什么呢
what will it look like finally
就是我们在最后一行
The last row
它等于括号的平方
equals the square of the bracketed term
括号里边是什么呢
What is in the brackets
括号里边是个分式
It is a fraction
分式的分子是(公式如上)
whose numerator is (the formula as above)
也就是x和y各自的离差乘积和
namely the sum of products of the respective deviations of x and y
而分母上
and whose denominator
是x和y的总的和方再开根号
is the square root of the total sum square of x and y
而这个式子就是在它的平方里边
While this formula in the brackets,
括号里边的这个式子
namely the one in its square
它的公式
is actually
其实就是r的公式
the formula for r
也就是我们说的相关系数的公式
namely the formula for the coefficient of correlation we have mentioned
所以r的平方就等于
So the square of r equals
相关系数的平方
the square of coefficient of correlation
那么这个地方
Here
我们要稍微的做一点点澄清
we shall make a slight clarification
就是如果说
If we
我们把那个大R的平方的
remove the square
平方去掉的话
of the square of R
那么这个R
then R
其实有一个更正式的名字
actually has a more formal appellation:
叫做复相关系数
coefficient of multiple correlation
复相关系数其实指的是说
The coefficient of multiple correlations indicates
这个值是y的观测值
the correlation between the observed value
和预测值之间的相关
and predictand of y
如果有多个自变量的话
In the case of multiple independent variables
这个预测值就变成了
the predictand becomes
多个x的一种线性组合
a linear combination of multiple x
那么当只有一个自变量的时候
In the case of only one independent variable
那么这个y的观测值
the coefficient of correlation between the observed value
和预测值的相关系数
and predictand of y
也就是这个大R的值
namely the value of R
也就等于y和x的相关系数
equals the coefficient of correlation between y and x
也就等于这个小r
and thus equals r
也就是说
In other words
那么只有当
in the mere case where
只有一个自变量的时候
there is only one independent variable
大R才等于小r
R equals r
如果有多个自变量的话
else in the case where there are multiple independent variables
大R就不等于小r了
R no longer equals r
因为这个时候
because
会存在着多个小r
there may exist multiple r
因为有几个自变量
After all, the number of coefficients of correlation between independent variables and y
就存在着几个自变量跟y的相关系数
is equal to the number of these independent variables
所以这个地方大家注意到就可以了
Everyone just pays attention here
那么R的平方我们还可以继续展开
We can continue to expand the square of R
我们看到R的平方可以等于b的平方
to find that the square of R can equal the square of b
乘以S{\fs16}x{\r}的平方除以S{\fs16}y{\r}的平方
To times the square of S{\fs16}x{\r}or to divide the square of S{\fs16}y{\r}
也就是说R的平方根回归系数
In other words, the square of R bears some relation to
也有一定的关系
the coefficient of regression
或者说复相关系数
or coefficient of multiple correlation
等于回归系数乘以S{\fs16}x{\r}比上S{\fs16}y{\r}
It equals the coefficient of regression times S{\fs16}x{\r} over S{\fs16}y{\r}
S{\fs16}x{\r}是x的标准差
where S{\fs16}x{\r} is the standard deviation of x
S{\fs16}y{\r}是y的标准差
and S{\fs16}y{\r} is the standard deviation of y
好 那么知道了R方的计算
Well, having known the calculation of the square of R
那么我们也计算到了b{\fs12}0{\r}和b{\fs12}1{\r}的值
and figured out the values of b{\fs12}0{\r} and b{\fs12}1{\r}
那么我们就可以计算到SSR SSE
we can figure out SSR and SSE
当然首先我们也可以计算到
Of course, we can first figure out
SST的值
the value of SST
这样的话我们就可以回到我们
This way we can return to
回归的方差分析表的部分
the section of variance analysis table for regression
当我们可以看到
We can see
在回归的方差分析表当中
the variance analysis table for regression
那么跟方差的分析表是很像的
bears high similarity to the analytic table for variance
同样的是有和方 自由度
in that both have the values of sum square, degree of freedom
均方和F这么几个值
mean square, and F
那么SSR
What about SSR
对 我们再重复一遍
Well, let’s reiterate
SSR是预测值和均值的差的平方和
SSR is the sum of squares of differences between predictands and mean
SSE是观测值和预测值的
whereas SSE is the sum of squares of differences
差的平方和
between observed values and predictands
那么SSE越大就说明
The greater the value of SSE
x对y的解释力越差
the lower the explanatory power of x over y
那么SSE的值越小
A smaller value of SSE
那么就说明说
means
x对y的解释力就会越大
the greater explanatory power of x over y
因为SSR的值就会越大
since the value of SSR is greater
那么SST是观测值
SST is the sum of squares of differences between
和均值的差的平方和
the observed values and mean
这个值在数据一定的时候
When data are fixed
SST是一个恒定的一个值
the value of SST is a constant
那么第二列是它的自由度
In the second column are degrees of freedom
那么SSR的自由度
The degree of freedom of SSR
也就是回归的自由度是1
namely the degree of freedom of regression, is 1
为什么是1呢
Why is it 1
你看 我们有两个参数
You see, we have two parameters:
一个是a 一个是b
One is a and the other is b
那么a和b这两个参数
Given both parameters a and b
知道了b就知道了a
once b is known a is also known
所以能够自由的变来变去的值
How many parameters are there
参数的个数是几个呢
whose value can vary freely
只有一个
Only one
那么残差是n-2
Then the residual is n � 2
那为什么是n-2呢
Why is it n � 2
因为我们有a和b这么两个值
Because we have two values of a and b
基本上有了a和b这么两个值以后
and basically, with these two values of a and b
那么残差可以自由变来变化的个数
the number of free-changing residuals
就变成了是n减去它的参数个数个
becomes n minus the number of parameters
那么最后是SST
Finally, it’s time to calculate SST
因为我们在计算SST的时候
At this point
需要用到总体的均值
we need to use the population mean
那么总体均值确定以后
After the population mean has been determined
那么可以自由变来变去的
the number of free-changing
观测值的个数就变成了n-1了
observed values becomes n � 1
那么均方 也就是MS
The same goes with the calculation of mean square
那么均方的计算
namely MS
还是跟我们之前
as we previously
在方差分析里面介绍的是一样的
introduced in variance analysis
那么均方就等于和方除以自由度
Mean square equals sum square divided by degree of freedom
所以MSR等于SSR除以一
so MSR equals SSR divided by 1
那么MSE等于SSE除以n-2
and MSE equals SSE divided by n � 2
那么这个是均方
This is mean square
其实这里的均方MSR跟MSE
Actually, the mean squares MSR and MSE
都是对于总体方差的无偏估计值
are both unbiased estimators of population variance
那么如果说这条回归线
If the regression line is said
是有用的一条回归线
to be a useful one
那么这里的x可以预测y
then here x can predict y
那么y的值会随着x的值
It follows that the value of y
变化而变化的话
would vary with the value x
而是这些预测值就会各不相同
And these predictands will become distinct
预测值之间的差距就会比较大
and the difference between them will become significant
那么这个时候MSR的值
Thus at this moment the value of MSR
就会比较大
would be great
那么MSR的值越大
A greater value of MSR
就越说明这条回归线
is more indicative of the fact that
它的b{\fs12}1{\r}的值应该不为零
the value of b{\fs12}1{\r} of the regression line should not be zero
那么我们用一个指标来标记它们
So we mark them using an index
用F F就等于MSR比上MSE
FF = MSR/MSE
如果H0为真的时候
If H0 is true
F的值应该是在1的左右
the value of F should be 1 or so
F的值越大
The greater the value of F
就越说明x对y的撬动的力量
the greater the power of leverage
会比较大
of x on y
那么这条回归线就越不会
and the regression line would less likely
是一条水平的线
be a horizontal line
那么怎么去检验这个F的值大小呢
So how to test the value of F
我们之前在介绍方差分析的时候
Actually, we have discussed this problem before
其实已经讨论过这个问题
while introducing variance analysis
那么我们要去查F的表
We shall look up the table for F
F的表它有两个自由度
which has two degrees of freedom
那么在阿尔法等于005的时候
When α=0.05
我们就可以查到
we can find out
对应的两个自由度
the two corresponding degrees of freedom
那么它的临界值是多少
What is its critical value
如果F值大于这个临界值
If the F value is greater than the critical value
那我们就去拒绝H0
then we reject H0
说x可以预测y
and say that x can predict y
说这条回归线是有用的
and that the regression line is useful
那么如果说小于这个F的临界值
if it is smaller than the critical value of F
我们就接受H0
then we accept H0
说那这条回归线其实可能
and say that the regression line may actually
对于y没有什么很明显的预测作用
have no significant predictive effect on y
那我们建立这个回归方程
Perhaps the regression equation we set up
可能说明不了什么问题
can explain nothing
我们没有发现x和y之间的关系
as we have not found any relation between x and y
好 那我们这节就介绍到这里
Well, so much for this section
-1.1 Applications in Business and Economics
--1.1.1 Statistics application: everywhere 统计应用:无处不在
-1.2 Data、Data Sources
--1.2.1 History of Statistical Practice: A Long Road 统计实践史:漫漫长路
-1.3 Descriptive Statistics
--1.3.1 History of Statistics: Learn from others 统计学科史:博采众长
--1.3.2 Homework 课后习题
-1.4 Statistical Inference
--1.4.1 Basic research methods: statistical tools 基本研究方法:统计的利器
--1.4.2 Homework课后习题
--1.4.3 Basic concepts: the cornerstone of statistics 基本概念:统计的基石
--1.4.4 Homework 课后习题
-1.5 Unit test 第一单元测试题
-2.1Summarizing Qualitative Data
--2.1.1 Statistical investigation: the sharp edge of mining raw ore 统计调查:挖掘原矿的利刃
-2.2Frequency Distribution
--2.2.1 Scheme design: a prelude to statistical survey 方案设计:统计调查的前奏
-2.3Relative Frequency Distribution
--2.3.1 Homework 课后习题
-2.4Bar Graph
--2.4.1 Homework 课后习题
-2.6 Unit 2 test 第二单元测试题
-Descriptive Statistics: Numerical Methods
-3.1Measures of Location
--3.1.1 Statistics grouping: from original ecology to systematization 统计分组:从原生态到系统化
--3.1.2 Homework 课后习题
-3.2Mean、Median、Mode
--3.2.2 Homework 课后习题
-3.3Percentiles
--3.3 .1 Statistics chart: show the best partner for data 统计图表:展现数据最佳拍档
--3.3.2 Homework 课后习题
-3.4Quartiles
--3.4.1 Calculating the average (1): Full expression of central tendency 计算平均数(一):集中趋势之充分表达
--3.4.2 Homework 课后习题
-3.5Measures of Variability
--3.5.1 Calculating the average (2): Full expression of central tendency 计算平均数(二):集中趋势之充分表达
--3.5.2 Homework 课后习题
-3.6Range、Interquartile Range、A.D、Variance
--3.6.1 Position average: a robust expression of central tendency 1 位置平均数:集中趋势之稳健表达1
--3.6.2 Homework 课后习题
-3.7Standard Deviation
--3.7.1 Position average: a robust expression of central tendency 2 位置平均数:集中趋势之稳健表达2
-3.8Coefficient of Variation
-3.9 unit 3 test 第三单元测试题
-4.1 The horizontal of time series
--4.1.1 Time series (1): The past, present and future of the indicator 时间序列 (一) :指标的过去现在未来
--4.1.2 Homework 课后习题
--4.1.3 Time series (2): The past, present and future of indicators 时间序列 (二) :指标的过去现在未来
--4.1.4 Homework 课后习题
--4.1.5 Level analysis: the basis of time series analysis 水平分析:时间数列分析的基础
--4.1.6Homework 课后习题
-4.2 The speed analysis of time series
--4.2.1 Speed analysis: relative changes in time series 速度分析:时间数列的相对变动
--4.2.2 Homework 课后习题
-4.3 The calculation of the chronological average
--4.3.1 Average development speed: horizontal method and cumulative method 平均发展速度:水平法和累积法
--4.3.2 Homework 课后习题
-4.4 The calculation of average rate of development and increase
--4.4.1 Analysis of Component Factors: Finding the Truth 构成因素分析:抽丝剥茧寻真相
--4.4.2 Homework 课后习题
-4.5 The secular trend analysis of time series
--4.5.1 Long-term trend determination, smoothing method 长期趋势测定,修匀法
--4.5.2 Homework 课后习题
--4.5.3 Long-term trend determination: equation method 长期趋势测定:方程法
--4.5.4 Homework 课后习题
-4.6 The season fluctuation analysis of time series
--4.6.1 Seasonal change analysis: the same period average method 季节变动分析:同期平均法
-4.7 Unit 4 test 第四单元测试题
-5.1 The Conception and Type of Statistical Index
--5.1.1 Index overview: definition and classification 指数概览:定义与分类
-5.2 Aggregate Index
--5.2.1 Comprehensive index: first comprehensive and then compare 综合指数:先综合后对比
-5.4 Aggregate Index System
--5.4.1 Comprehensive Index System 综合指数体系
-5.5 Transformative Aggregate Index (Mean value index)
--5.5.1 Average index: compare first and then comprehensive (1) 平均数指数:先对比后综合(一)
--5.5.2 Average index: compare first and then comprehensive (2) 平均数指数:先对比后综合(二)
-5.6 Average target index
--5.6.1 Average index index: first average and then compare 平均指标指数:先平均后对比
-5.7 Multi-factor Index System
--5.7.1 CPI Past and Present CPI 前世今生
-5.8 Economic Index in Reality
--5.8.1 Stock Price Index: Big Family 股票价格指数:大家庭
-5.9 Unit 5 test 第五单元测试题
-Sampling and sampling distribution
-6.1The binomial distribution
--6.1.1 Sampling survey: definition and several groups of concepts 抽样调查:定义与几组概念
-6.2The geometric distribution
--6.2.1 Probability sampling: common organizational forms 概率抽样:常用组织形式
-6.3The t-distribution
--6.3.1 Non-probability sampling: commonly used sampling methods 非概率抽样:常用抽取方法
-6.4The normal distribution
--6.4.1 Common probability distributions: basic characterization of random variables 常见概率分布:随机变量的基本刻画
-6.5Using the normal table
--6.5.1 Sampling distribution: the cornerstone of sampling inference theory 抽样分布:抽样推断理论的基石
-6.9 Unit 6 test 第六单元测试题
-7.1Properties of point estimates: bias and variability
--7.1.1 Point estimation: methods and applications 点估计:方法与应用
-7.2Logic of confidence intervals
--7.2.1 Estimation: Selection and Evaluation 估计量:选择与评价
-7.3Meaning of confidence level
--7.3.1 Interval estimation: basic principles (1) 区间估计:基本原理(一)
--7.3.2 Interval estimation: basic principles (2) 区间估计:基本原理(二)
-7.4Confidence interval for a population proportion
--7.4.1 Interval estimation of the mean: large sample case 均值的区间估计:大样本情形
--7.4.2 Interval estimation of the mean: small sample case 均值的区间估计:小样本情形
-7.5Confidence interval for a population mean
--7.5.1 Interval estimation of the mean: small sample case 区间估计:总体比例和方差
-7.6Finding sample size
--7.6.1 Determination of sample size: a prelude to sampling (1) 样本容量的确定:抽样的前奏(一)
--7.6.2 Determination of sample size: a prelude to sampling (2) 样本容量的确定:抽样的前奏(二)
-7.7 Unit 7 Test 第七单元测试题
-8.1Forming hypotheses
--8.1.1 Hypothesis testing: proposing hypotheses 假设检验:提出假设
-8.2Logic of hypothesis testing
--8.2.1 Hypothesis testing: basic ideas 假设检验:基本思想
-8.3Type I and Type II errors
--8.3.1 Hypothesis testing: basic steps 假设检验:基本步骤
-8.4Test statistics and p-values 、Two-sided tests
--8.4.1 Example analysis: single population mean test 例题解析:单个总体均值检验
-8.5Hypothesis test for a population mean
--8.5.1 Analysis of examples of individual population proportion and variance test 例题分析 单个总体比例及方差检验
-8.6Hypothesis test for a population proportion
--8.6.1 P value: another test criterion P值:另一个检验准则
-8.7 Unit 8 test 第八单元测试题
-Correlation and regression analysis
-9.1Correlative relations
--9.1.1 Correlation analysis: exploring the connection of things 相关分析:初探事物联系
--9.1.2 Correlation coefficient: quantify the degree of correlation 相关系数:量化相关程度
-9.2The description of regression equation
--9.2.1 Regression Analysis: Application at a Glance 回归分析:应用一瞥
-9.3Fit the regression equation
--9.3.1 Regression analysis: equation establishment 回归分析:方程建立
-9.4Correlative relations of determination
--9.4.1 Regression analysis: basic ideas
--9.4.2 Regression analysis: coefficient estimation 回归分析:系数估计
-9.5The application of regression equation