9.3.1 Regression analysis: equation establishment 回归分析：方程建立慕课视频播放-Learn Statistics with Ease-MOOC慕课视频教程-柠檬大学

好各位同学
All right, fellow students

我们这节来介绍一些
In this section, we will present some

关于回归分析的一些基本的概念
basic concepts on regression analysis

那么回归分析一般来说
Generally speaking, regression analysis

至少会涉及到两个变量之间的关系
involves the relation between at least two variables

那我们先看一个例子
Let’s examine an example first

在这个例子里面大家可以看到
In this example, everyone can notice

我们有一个坐标轴
we have a pair of coordinate axes

在这个坐标轴上
of which

Y轴表示的是中国各省市的平均寿命
the Y-axis denotes the average lifespan in all municipalities and provinces of China

X轴表示的是每万人床位数
whereas the X-axis denotes the number of sickbeds per 10,000 people

那么这个数据就是说
These data suggest

我们至少要有两个变量
we shall have at least two variables

一个变量我们要搜集到
one being the average lifespan we shall collect

中国31个省市的平均寿命
in the 31 municipalities and provinces of China

而第二列变量是相对应的
the other being the corresponding

每个省市的每万人床位数
number of sickbeds per 10,000 people per municipality/province

那么这二者之间
So what kind of relation

是一个什么样的关系呢
lies between both

如果我们把这两个变量的值
If we put the values of both variables

放到同样的一个坐标轴上
onto a same pair of coordinate axes

那就像我们现在看到的这样
as we are seeing now

它们就可以画成一条
they can look like a

画成散点图的样子
scatter plot

有的同学可能会想
Perhaps some students may think

那么我们为什么要考虑
why we shall consider

这两个变量之间的关系呢
the relation between these two variables

那大家可以想一想
Everyone can have a guess

每万人床位数是个什么概念
what is the concept of the number of sickbeds per 10,000 people

每万人床位数也就意味着
The number of sickbeds per 10,000 people means

政府在公共卫生上的开支
government’s expenditure on public health

建立一家好一点的医院
After all, building a well-performing hospital

需要的投入是很高的
calls for very high investments

比如说像一个三甲的医院
for example, a grade-three and a first-class hospital

那么它需要有各种各样医疗的设备
calls for a wide variety of medical equipment

需要有病房大楼
It calls for a ward building

需要有门诊的大楼
and an outpatient building

那么需要有一些基建的开支
thereby some expenditures on infrastructures

而且这样的医院
Moreover, such a hospital

还需要很多的医生和护士
also calls for many doctors and nurses

所以我们可以想一想
So we can imagine

政府去建设这样的医院的话
it would actually take a huge investment

其实是需要很大的投入的
for the government to build such a hospital

那么建设这样的医院
So can building such hospitals

那对于我们的人均寿命
really extend our lifespan per capita

是不是真的能够延长我们的寿命呢

所以在这个例子当中主要想讨论
In this example the foremost thing to discuss

那么这两者之间的关系
is the relation between both

也就是说政府在公共医疗上的投入
namely government’s investment in public medical service

和对人们寿命的影响
versus its effect on people’s lifespan

那么只是说在公共医疗上的投入
We just say we choose a very concrete variable

我们选了一个非常具体的一个变量
as the investment in public medical service

选择了每万人的床位数
namely the number of sickbeds per 10,000 people

如果我们观测这样一个散点图的话
When observing such a scatter plot

有一条线
we find there is a line

可以从左下角
that can be drawn indistinctly from the bottom left corner

隐隐约约的可以画到右上角去
to the top right corner

而如果我们画出来这条线的话
When this line is drawn out

大家可能就会想起来
perhaps everyone would recall

你们在高中的时候学过的一个知识
a knowledge you learned at senior high school

叫最小二乘法
which is called the least square method

那么最小二乘法就可以帮助我们
The least-square method can help us

去建立起来这样的一条回归线
set up such a regression line

那对于这样一条回归线来说
Such a regression line

它有两个非常重要的参数
includes two highly important parameters

一个参数是它的截距
One parameter is its intercept

截距意味着什么呢
What does intercept mean

意味着是说当X为零的时候
It means the Y value of the regression line

那么这条回归线在Y轴上的
at the intersection point in the Y line

交叉的那个点的Y值
when X is zero

那么另外一个非常重要的参数
The other very important parameter

是它的斜率
is its slope

斜率是说当X每变化一个单位
The slope indicates by how many units Y varies

Y变化多少单位
for each unit X varies by

那么在回归分析当中
In regression analysis

经常的我们其实不是特别的
most of the time we actually pay little

去重视截距的含义
attention to the connotation of intercept

因为截距有的时候它的含义
since it is sometimes

并不是特别清楚
not very clear

而我们实际上
While in fact, what we

特别关注的是两个变量之间的关系
pay particular attention to is the relation between two variables

也就是说我们会特别关心斜率
Namely, we would pay special attention to the slope

如果说斜率比较大的话
A great slope

那就好像是一个杠杆一样
is like a lever

我们可以看到X对Y的撬动的力量
whereby we can notice the moment of X on Y

就会比较大
is significant

如果说这斜率比较小
A small slope

那就意味着X对Y撬动的力量
means the moment of X on Y

就比较小
is small

那如果说
If expressing it

我们用一个符号来表示的话
in a symbol

我们一般会写成
we would generally write it into

（公式如上）
(the formula as above)

β{\fs12}0{\r}就是说的叫截距
Where β{\fs12}0{\r} is the so-called intercept

β{\fs12}1{\r}是斜率
and β{\fs12}1{\r} is the slope

而Y hat
While Y hat

上面加了一个尖
with a sharp hat on it

那Y hat指的是
refers to

Y的预测值
the predictand of Y

那为什么是Y hat
Why do we use Y hat

而不是Y呢
instead of Y

对我们待会会再讨论这个问题
Well, we will discuss this question in a while

那么对于回归分析来讲
For regression analysis

我们其实就是要建立起来
what we shall actually do is set up

这样的一个回归方程
such a regression equation

就是（公式如上）
as (the formula above)

我们需要能够估计到
We need to be able to estimate

β{\fs12}0{\r}和β{\fs12}1{\r}的值
the values of β{\fs12}0{\r} and β{\fs12}1{\r}

那么通过这样一个回归方程
With such a regression equation

那么我们就可以实现一些目的
we can achieve some goals

第一个目的我们可以实现
The first goal we can achieve

把X和Y的关系
is to express the relation between X and Y

用一种量化的方式来表达
by a quantifying method

很多时候我们知道说
Most of the time we know

当X增加了时候 Y也会增加
as X increases Y increases

但是我们并不知道说
but we do not know

当X增加的时候 Y增加多少
how much Y increases as X increases

或者是说反过来
or on the opposite

当X增加的时候 Y减少多少
how much Y decreases as X increases

那么如果我们想要
If we want to

得到一个非常确定的值
get a highly definite value

我们就需要用回归方程
we need to use the regression equation

用回归分析来实现
to achieve this through regression analysis

另外我们可以检验有关
Besides, we can test the theories on

X和Y之间关系的理论
the relation between X and Y

就像在上一节当中
Just like in the previous section

我们讨论了资源理论
we discussed the resource theory

和好基因理论
and the good gene theory

那么究竟哪个理论
Exactly which theory

会更符合实际的情况呢
would conform to the real situation

那我们需要用回归方程的方法
We need to use the method of the regression equation

来实现对于这个相关理论的
to implement some tests

一些检验
on these relevant theories

那么第三我们可以测量
Third, we can measure

X和Y之间关系的强度
the intensity of the relation between X and Y

X和Y之间的关系
Among the relations between X and Y

有的关系可能是比较弱的
some may be weak

有的关系可能是比较强的
and some may be strong

比如说我们每个同学
For instance, every student

可能都参加过高考
may have taken the college entrance examination

那你参加高考的时候
When you took the examination

影响你高考成绩的因素会有很多
there might be a myriad of factors influencing your score

比如说智商
such as IQ

比如说你高考当天的身体的状况
your physical condition on the very day of the examination

比如说你高考当天的气温
the air temperature of the very day of the examination

比如说高考当天的交通堵塞的情况
and the condition of traffic congestion on the very day of the examination

那么这些都有可能会影响到
All these could make a difference to

你的高考的成绩
your score in the college entrance examination

如果我们把这些变量都考虑进去
If we take all these variables into consideration

我们可能会发现说
we may tell

在这个里边
which of those variables

能够对你的高考成绩
can have the strongest influence

影响力最强的那个变量是什么
on your score in the college entrance examination

那这样的话我们就可以实现说
That way we can conclude

这些变量他们和高考成绩之间的
which relation between these variables and the score of college entrance examination

关系的强度
is stronger

究竟是哪个更强哪个更弱
and which one is weaker

那么这个也是回归分析
This is also another function

可以实现的一个功能
regression analysis can implement

最后我们可以实现预测
Finally, we can achieve

就是在已知X值的条件下
predicting the value of Y

对Y来实现预测
under the condition that the value of X is known

而这个预测
While this

预测值就是这个Y hat
predictand is Y hat

比如说我们之前在介绍
For instance, while introducing

时间序列分析的时候
time series analysis previously

老师会介绍一种方法
the instructor would introduce a method

叫做趋势方程法
called trend equation method

趋势方程它的自变量是时间
The independent variable in a trend equation is time

而因变量是一些经济运行的指标
whereas the dependent variables are some indices of economic operation

如果我们通过趋势方程法
If we have set up an index of some economic operation

建立了某一个经济运行的指标
by the trend equation method

随着时间的变化而变化的
a regression equation that

这样的一个回归方程的话
varies with time

那么我们就可以预测到下一年
then we can predict

或者到下两年
what value this index of economic operation

那么这个经济运行的指标
would probably reach

大概会达到一个什么样的值
by next year or by the year after

这是回归分析的几个目的
The above are several goals of regression analysis

那么在我们进一步的介绍之前
Before making further introduction

我们想介绍两个基本的概念
we want to introduce two basic concepts:

一个概念叫确定模型
One is called the definitive model

确定模型也是用函数的形式
A definitive model is also expressed

来表示的
in form of function

那么在这种确定模型里面
In such a definitive model

每一个X值都对应着一个单一的Y值
every value of X corresponds to a single value of Y

比如说大家看到这个例子
Look at this example

某个实验室打算采购一批计算机
A laboratory plans to purchase a batch of computers

一台是6500块钱
The unit price is 6500 yuan

X是计算机的台数
If X represents the number of computers

那Y就是总花费
then Y is the total cost

那么在计算机的台数
It follows that the relation between the number of computers

和总花费之间的关系
and the total cost

就是Y等于6500X
is given by Y=6500X

那这样的话我们就可以看到
Thus we can notice

X有一个值
for each value of X

Y就一定有一个确定的值
Y must have a definite value

所以二者之间是一个
So there is a one-to-one correspondence

一一的对应关系
between both

那么在这个里边
Here

更常见的例子是
a more common example is

比如说不同的度量单位之间的转换
the conversion between different units of measurement

比如说对温度来说华氏和摄氏
say Fahrenheit and Celsius for temperature

那么对于一些质量的单位来说
and kilogram and pound

比如说公斤和镑
as units of mass

那么我们现在在PPT上看到的一个
What we are seeing on the PPT is a

是在华氏和摄氏之间的关系
relation between Fahrenheit and Celsius

那么华氏等于什么呢
What does Fahrenheit equal

等于5/9的摄氏加上32
It equals 5/9 of Celsius plus 32

那摄氏和华氏之间的关系
So the Celsius-Fahrenheit relation

也是一一对应的
is also one-to-one correspondence

这是确定模型
This is a definitive model

那么确定模型
A definitive model

其实是不需要去估计的
does not need to be estimated

因为确定模型基本上来说
since basically the definitive model

就是我们在确定它的时候
is accurate

是比较准确的
when we determine it

那么回归分析
Then what regression analysis

其实它要解决的
actually solves

还不是确定模型这样的问题
are not such problems of the definitive model

那么回归分析要解决是
but

概率模型的问题
problems of the probabilistic model

概率模型指的是说
The probabilistic model means

当X取某个值的时候
when X takes some value

Y的值是不确定的
the value of Y is not definite

而是服从某一个概率分布
but obeys a certain probability distribution

那么这个时候X和Y之间关系
The relation between X and Y at this moment

就叫概率模型
is called the probabilistic model

那么Y的值不确定
That the value of Y is indefinite

并不意味着是说
does not mean that

Y的值是随意的
the value of Y is arbitrary

而是说Y的值是服从某个概率分布
but that the value of Y obeys a certain probability distribution

而这个概率分布就有它的期望
which has its expectation

还有它的方差
and variance

比如说
In the example above

像我们刚才介绍的这个例子里边
we have just introduced

当X是每万人病床数
where X denotes the number of sickbeds per 10,000 people

Y是平均寿命
and Y denotes the average lifespan

我们在散点图里面其实也可以看到
as can actually be seen in the scatter plot

那么当X取一个值的时候
when X takes a value

Y是有多个值和它对应
Y has multiple values to correspond to it

那么如果我们
If we set up

建立起来一条回归线的话
a regression line

就像我们刚才做出的
like the one

这条回归线一样
we created just now

那我们可以得到β{\fs12}0{\r}和β{\fs12}1{\r}的值
then we can obtain the values of β{\fs12}0{\r} and β{\fs12}1{\r}

Y hat等于68065加上01641X
Y hat equals 68065 plus 01641X

那这个时候其实你会看到
At this moment you will notice

这条线只是隐隐约约的存在着
this line just exists indistinctly

或者说即使我们把这条线
Or we say even if we draw this line

用实线的形式画出来
in the form of a solid line

那么也并没有任意的一个点
none of the points would

或者说大部分的点
or most of the points

都不是正好落在这条回归线上的
would not, exactly fall in this regression line

很多点都是
Most points would

在这条回归线的上下波动
fluctuate above and below this regression line

那就会跟确定模型有很大的的不同
That would be largely different from the definitive model

我们刚才说了
We have just said

说在这条回归线上
the regression equation we have set up

我们建立起来的回归方程
in this regression line

是Y hat等于68065加上01641X
is Y hat = 68065 + 01641X

那如果对于任意的一个Y
So for any Y

这个方程要怎么写呢
how shall this equation be written

任意的Y我们就要后面
We shall add a residual term

再加上一个残差项
for an arbitrary value of Y

也就是说Y{\fs12}i{\r}
namely Y{\fs12}i{\r}

等于68065加上01641X
equals 68065 plus 01641X

再加是一个残差
plus a residual

这个残差是一个随机变量
where the residual is a random variable

那也就是说我们通过这个回归方程
That is to say, we can obtain a predictand

可以得到一个预测值
by this regression equation

而真实的观测值
While the actual observed value

是在这个预测值的上下波动的
fluctuates above or below this predictand

而究竟波动多少
We use {\fs22}e{\r}{\fs12}i{\r} to measure

我们是用{\fs22}e{\r}{\fs12}i{\r}来测量的
the exact amount of the fluctuation

{\fs22}e{\r}{\fs12}i{\r}是一个随机变量
{\fs22}e{\r}{\fs12}i{\r} is a random variable

那为什么对于回归分析
Why there must exist {\fs22}e{\r}{\fs12}i{\r}

或者说对于概率模型来讲
for regression analysis

一定会存在着一个{\fs22}e{\r}{\fs12}i{\r}呢
or probabilistic model

因为我们要知道
Since we shall know

对于整个的现实世界来讲
for the entire real world

任意一个的原因
any cause

都会产生多个结果
would beget multiple outcomes

而任意的一个结果
while any outcome

也都是由多个原因共同作用
has come about

所产生的
under the joint effect of multiple causes

所以当我们在回归方程当中
This is why we include an independent variable

我们放入了一个自变量
in the regression equation

其实也就意味着
which actually means

我们放弃了很多很多其他的自变量
we have given up quite a lot of other independent variables

那这些其他自变量
Have these other independent variables

是不是就放弃了他们对Y的影响呢
given up their effect on Y

对那当然是没有的
Well, they certainly have not

因为只是说
Since it is just to say

我们从认识世界的角度来看
from the perspective of knowing the world

我们会尽量的希望
we would hope

能够获得一个既简洁
to get succinct and beneficial access

又有利的关于世界的途径
to the world as much as possible

但是其实世界真实的运行关系
but the real operational relations in the world

是非常复杂的
are actually very convoluted

那么有些变量对于因变量
The effect of some variables on the dependent variable

或者是对于Y的影响
or on Y

或者是非常的琐碎或者是非常的小
is extremely trivial or insignificant

或者是说在我们目前的测量
or ignored

或者是对变量的考察当中
in our current measurement

没有把它考虑进去
or in our examination into the variables

但是它仍然在发挥着作用
But they remain playing their role

那么这样的作用
and such a role

就构成了就是{\fs22}e{\r}{\fs12}i{\r}的来源
is the source of {\fs22}e{\r}{\fs12}i{\r}

这是{\fs22}e{\r}{\fs12}i{\r}最大的一个来源
This is the most significant source of {\fs22}e{\r}{\fs12}i{\r}

那么还会有一些别的来源
Still there are some other sources

比如说如果我们去测量一些
For example, if we measure some

动物的行为
Animals’ behavior

那么动物的行为
then the animals’ behavior

有的时候会有一些随机性
would have some randomness at times

那并不一定完全的
It does not always completely

去服从某一个规律的分布
obey the distribution of a certain law

另外在测量的过程当中
Furthermore, the measuring process

也会有测量误差
is also open to measuring errors

因为我们要用仪器去实现这个测量
Since we shall implement the measurement using some gauges

测量就是会有
there would be a certain difference

或者跟那个变量真实的值
in the measurement

会有一定的差异
from the true value of the variable

这也是{\fs22}e{\r}{\fs12}i{\r}的来源之一
This is also one of the sources of {\fs22}e{\r}{\fs12}i{\r}

那我们再来举一个例子
Let’s take another example

我们就会更深入的
in order to understand the probabilistic model

去理解这个概率模型
in more depth

比如说我们现在看到这个例子
The example we are now looking at

是一个人他的每周的收入
is the relation between an individual’s weekly income

和看电影开支之间的关系
and expenditure on seeing movie

现在有很多同学就非常喜欢的
Nowadays many students enjoy

去电影院看电影
seeing a movie at the cinema

那么去看电影的话
To see a movie

现在电影票有的也不是特别便宜
you have to pay for the cinema ticket, which is not very cheap

如果你收入高的话
If you have a high income

可能就是支付
you may have no problem

这个电影的开支是没有问题的
paying the expense for the movie

那么如果是收入低的话
Else if you have a low income

去支付电影的开支
you may feel a little pain

有的时候也会觉得有点心痛
on paying the expense for the movie at times

那么收入对于电影开支的影响
So what is the exact effect

究竟是什么呢
of income on the expenditure of the movie

那我们搜集了一些数据
We have collected some data

在这个数据当中
Among these data

我们也画了一个散点图
we have graphed a scatter plot

而且我们也做出了一条回归线
as well as a regression line

我们可以看到这样一条回归线
We can see such a regression line

那么在这条回归线当中
in which

Y的预测值等于
the predictand of Y equals

13.92加上0.076X
13.92 plus 0.076X

也就是说它的截距是13.92
meaning its intercept is 13.92

而斜率是0.076
whereas its slope is 0.076

我们刚才其实说过一点
Actually we have just now mentioned

说截距很多时候
that the intercept on most occasions

是没有很明确的意义的
is of no specific significance

那这个地方你就可以看到
Here you can notice

那么它的截距等于13.92
the intercept equals 13.92

意味着什么呢
What does this mean

意味着当X为零的时候
It means when X is zero

它还会在看电影上花13.92元
there remains an expense of 13.92 yuan on seeing movie

这个就很难理解了
which is difficult to understand

所以它的意义其实不明确的
Hence its significance is unspecific

那我们其实真正关心的是什么呢
So what do we care

真正关心的是0.076
What we care about is 0.076

0.076意味着什么呢
What does 0.076 mean

意味着说当你的收入
It means for each yuan

每增加一块钱
your income increases

你的看电影的开支就会增加七分钱
your expenditure on seeing movie would increase 7 cents

那如果说按照一张电影票
If one cinema ticket

30块钱来算的话
is sold for 30 yuan

大家考虑一下
everyone thinks over

那你的收入要增加多少
how much you have to increase your income

你才会多去看一场电影呢
until you are ready to see one more movie

好我们再过来看一下这个表
Well, let’s look back at this table

在这个表当中你可以看到
From this table, you can see

我们左边的一列是收入的数据
the left column is the data on the income

那么有的收入的数据
Some of the data

是九百有的是八百
are nine hundred, some are eight hundred

有的是六百
some are six hundred

也有的六百五
and still some are six hundred and fifty

那么对应着具体的某一个收入
Corresponding to a specific income

Y都有多个值和它对应
Y has multiple values

那么这种就是我们说的概率模型
Such is the probabilistic model we are talking about

当X取值固定的时候
When the value of X is constant

Y的取值并不固定
the value of Y is not constant

Y的取值是服从某一个概率分布
but it obeys a certain probability distribution

而这个概率分布是有期望的
which has its expectation

这个期望就是我们在这个表的
as seen in the last column

最后一列看到的
of this table

说当给定X的时候
When X is given

Y的期望是多少
what is the expectation of Y

好那么我们在回归线上的
Well, it is actually like this

在回归分析当中其实是这个
in regression analysis based on the regression line

在给定X的时候
When X is given

Y的期望是正好落在一条直线上
the expectation of Y falls exactly in a straight line

而观测值和这个期望之间的距离
The distance between the observed value and the expectation

就是我们说的残差
is the so-called residual

9.3.1 Regression analysis: equation establishment 回归分析：方程建立在线视频

9.3.1 Regression analysis: equation establishment 回归分析：方程建立课程教案、知识点、字幕

Learn Statistics with Ease课程列表：

Chapter 1 Data and Statistics

Chapter 2 Descriptive Statistics: Tabular and Graphical Methods

Chapter 3 Descriptive Statistics: Numerical Methods

Chapter 4 Time Series Analysis

Chapter 5 Statistical Index

Chapter 6 Sampling Distributions

Chapter 7 Confidence Intervals

Chapter 8: Hypothesis Tests

Chapter 9 Correlation and Regression Analysis

9.3.1 Regression analysis: equation establishment 回归分析：方程建立笔记与讨论

也许你还感兴趣的课程: