首页  /  世界杯中国对巴西  /  线性回归

线性回归

世界杯中国对巴西 5804

主条目:最小二乘法

最小二乘法估计

编辑

回归分析的最初目的是估计模型的参数以便达到对数据的最佳拟合。在决定一个最佳拟合的不同标准之中,最小二乘法是非常优越的。这种估计可以表示为:

β

^

=

(

X

T

X

)

1

X

T

y

{\displaystyle {\hat {\beta }}=(X^{T}X)^{-1}X^{T}y\,}

回归推论

编辑

对于每一个

i

=

1

,

,

n

{\displaystyle i=1,\ldots ,n}

,我们用

σ

2

{\displaystyle \sigma ^{2}}

代表误差项

ε

{\displaystyle \varepsilon }

的方差。一个无偏误的估计是:

σ

^

2

=

S

n

p

,

{\displaystyle {\hat {\sigma }}^{2}={\frac {S}{n-p}},}

其中

S

:=

i

=

1

n

ε

^

i

2

{\displaystyle S:=\sum _{i=1}^{n}{\hat {\varepsilon }}_{i}^{2}}

是误差平方和(残差平方和)。估计值和实际值之间的关系是:

σ

^

2

n

p

σ

2

χ

n

p

2

{\displaystyle {\hat {\sigma }}^{2}\cdot {\frac {n-p}{\sigma ^{2}}}\sim \chi _{n-p}^{2}}

其中

χ

n

p

2

{\displaystyle \chi _{n-p}^{2}}

服从卡方分布,自由度是

n

p

{\displaystyle n-p}

对普通方程的解可以写为:

β

^

=

(

X

T

X

)

1

X

T

y

.

{\displaystyle {\hat {\boldsymbol {\beta }}}=(\mathbf {X^{T}X)^{-1}X^{T}y} .}

这表示估计项是因变量的线性组合。进一步地说,如果所观察的误差服从正态分布。参数的估计值将服从联合正态分布。在当前的假设之下,估计的参数向量是精确分布的。

β

^

N

(

β

,

σ

2

(

X

T

X

)

1

)

{\displaystyle {\hat {\beta }}\sim N(\beta ,\sigma ^{2}(X^{T}X)^{-1})}

其中

N

(

)

{\displaystyle N(\cdot )}

表示多变量正态分布。

参数估计值的标准差是:

σ

^

j

=

S

n

p

[

(

X

T

X

)

1

]

j

j

.

{\displaystyle {\hat {\sigma }}_{j}={\sqrt {{\frac {S}{n-p}}\left[\mathbf {(X^{T}X)} ^{-1}\right]_{jj}}}.}

参数

β

j

{\displaystyle \beta _{j}}

100

(

1

α

)

%

{\displaystyle 100(1-\alpha )\%}

置信区间可以用以下式子来计算:

β

^

j

±

t

α

2

,

n

p

σ

^

j

.

{\displaystyle {\hat {\beta }}_{j}\pm t_{{\frac {\alpha }{2}},n-p}{\hat {\sigma }}_{j}.}

误差项可以表示为:

r

^

=

y

X

β

^

=

y

X

(

X

T

X

)

1

X

T

y

.

{\displaystyle \mathbf {{\hat {r}}=y-X{\hat {\boldsymbol {\beta }}}=y-X(X^{T}X)^{-1}X^{T}y} .\,}

单变量线性回归

编辑

单变量线性回归,又称简单线性回归(simple linear regression, SLR),是最简单但用途很广的回归模型。其回归式为:

Y

=

α

+

β

X

+

ε

{\displaystyle Y=\alpha +\beta X+\varepsilon }

为了从一组样本

(

y

i

,

x

i

)

{\displaystyle (y_{i},x_{i})}

(其中

i

=

1

,

2

,

,

n

{\displaystyle i=1,\ 2,\ldots ,n}

)之中估计最合适(误差最小)的

α

{\displaystyle \alpha }

β

{\displaystyle \beta }

,通常采用最小二乘法,其计算目标为最小化残差平方和:

i

=

1

n

ε

i

2

=

i

=

1

n

(

y

i

α

β

x

i

)

2

{\displaystyle \sum _{i=1}^{n}\varepsilon _{i}^{2}=\sum _{i=1}^{n}(y_{i}-\alpha -\beta x_{i})^{2}}

使用微分法求极值:将上式分别对

α

{\displaystyle \alpha }

β

{\displaystyle \beta }

做一阶偏微分,并令其等于0:

{

n

α

+

i

=

1

n

x

i

β

=

i

=

1

n

y

i

i

=

1

n

x

i

α

+

i

=

1

n

x

i

2

β

=

i

=

1

n

x

i

y

i

{\displaystyle \left\{{\begin{array}{lcl}n\ \alpha +\sum \limits _{i=1}^{n}x_{i}\ \beta =\sum \limits _{i=1}^{n}y_{i}\\\sum \limits _{i=1}^{n}x_{i}\ \alpha +\sum \limits _{i=1}^{n}x_{i}^{2}\ \beta =\sum \limits _{i=1}^{n}x_{i}y_{i}\end{array}}\right.}

此二元一次线性方程组可用克莱姆法则求解,得解

α

^

,

β

^

{\displaystyle {\hat {\alpha }},\ {\hat {\beta }}}

β

^

=

n

i

=

1

n

x

i

y

i

i

=

1

n

x

i

i

=

1

n

y

i

n

i

=

1

n

x

i

2

(

i

=

1

n

x

i

)

2

=

i

=

1

n

(

x

i

x

¯

)

(

y

i

y

¯

)

i

=

1

n

(

x

i

x

¯

)

2

=

cov

(

X

,

Y

)

var

(

X

)

{\displaystyle {\hat {\beta }}={\frac {n\sum \limits _{i=1}^{n}x_{i}y_{i}-\sum \limits _{i=1}^{n}x_{i}\sum \limits _{i=1}^{n}y_{i}}{n\sum \limits _{i=1}^{n}x_{i}^{2}-\left(\sum \limits _{i=1}^{n}x_{i}\right)^{2}}}={\frac {\sum \limits _{i=1}^{n}(x_{i}-{\bar {x}})(y_{i}-{\bar {y}})}{\sum \limits _{i=1}^{n}(x_{i}-{\bar {x}})^{2}}}={\frac {{\text{cov}}(X,Y)}{{\text{var}}(X)}}\,}

α

^

=

i

=

1

n

x

i

2

i

=

1

n

y

i

i

=

1

n

x

i

i

=

1

n

x

i

y

i

n

i

=

1

n

x

i

2

(

i

=

1

n

x

i

)

2

=

y

¯

x

¯

β

^

{\displaystyle {\hat {\alpha }}={\frac {\sum \limits _{i=1}^{n}x_{i}^{2}\sum \limits _{i=1}^{n}y_{i}-\sum \limits _{i=1}^{n}x_{i}\sum \limits _{i=1}^{n}x_{i}y_{i}}{n\sum \limits _{i=1}^{n}x_{i}^{2}-\left(\sum \limits _{i=1}^{n}x_{i}\right)^{2}}}={\bar {y}}-{\bar {x}}{\hat {\beta }}}

S

=

i

=

1

n

(

y

i

y

^

i

)

2

=

i

=

1

n

y

i

2

n

(

i

=

1

n

x

i

y

i

)

2

+

(

i

=

1

n

y

i

)

2

i

=

1

n

x

i

2

2

i

=

1

n

x

i

i

=

1

n

y

i

i

=

1

n

x

i

y

i

n

i

=

1

n

x

i

2

(

i

=

1

n

x

i

)

2

{\displaystyle S=\sum \limits _{i=1}^{n}(y_{i}-{\hat {y}}_{i})^{2}=\sum \limits _{i=1}^{n}y_{i}^{2}-{\frac {n(\sum \limits _{i=1}^{n}x_{i}y_{i})^{2}+(\sum \limits _{i=1}^{n}y_{i})^{2}\sum \limits _{i=1}^{n}x_{i}^{2}-2\sum \limits _{i=1}^{n}x_{i}\sum \limits _{i=1}^{n}y_{i}\sum \limits _{i=1}^{n}x_{i}y_{i}}{n\sum \limits _{i=1}^{n}x_{i}^{2}-\left(\sum \limits _{i=1}^{n}x_{i}\right)^{2}}}}

σ

^

2

=

S

n

2

.

{\displaystyle {\hat {\sigma }}^{2}={\frac {S}{n-2}}.}

协方差矩阵是:

1

n

i

=

1

n

x

i

2

(

i

=

1

n

x

i

)

2

(

x

i

2

x

i

x

i

n

)

{\displaystyle {\frac {1}{n\sum _{i=1}^{n}x_{i}^{2}-\left(\sum _{i=1}^{n}x_{i}\right)^{2}}}{\begin{pmatrix}\sum x_{i}^{2}&-\sum x_{i}\\-\sum x_{i}&n\end{pmatrix}}}

平均响应置信区间为:

y

d

=

(

α

+

β

^

x

d

)

±

t

α

2

,

n

2

σ

^

1

n

+

(

x

d

x

¯

)

2

(

x

i

x

¯

)

2

{\displaystyle y_{d}=(\alpha +{\hat {\beta }}x_{d})\pm t_{{\frac {\alpha }{2}},n-2}{\hat {\sigma }}{\sqrt {{\frac {1}{n}}+{\frac {(x_{d}-{\bar {x}})^{2}}{\sum (x_{i}-{\bar {x}})^{2}}}}}}

预报响应置信区间为:

y

d

=

(

α

+

β

^

x

d

)

±

t

α

2

,

n

2

σ

^

1

+

1

n

+

(

x

d

x

¯

)

2

(

x

i

x

¯

)

2

{\displaystyle y_{d}=(\alpha +{\hat {\beta }}x_{d})\pm t_{{\frac {\alpha }{2}},n-2}{\hat {\sigma }}{\sqrt {1+{\frac {1}{n}}+{\frac {(x_{d}-{\bar {x}})^{2}}{\sum (x_{i}-{\bar {x}})^{2}}}}}}

方差分析

编辑

在方差分析(ANOVA)中,总平方和分解为两个或更多部分。

总平方和SST (sum of squares for total) 是:

SST

=

i

=

1

n

(

y

i

y

¯

)

2

{\displaystyle {\text{SST}}=\sum _{i=1}^{n}(y_{i}-{\bar {y}})^{2}}

,其中:

y

¯

=

1

n

i

y

i

{\displaystyle {\bar {y}}={\frac {1}{n}}\sum _{i}y_{i}}

同等地:

SST

=

i

=

1

n

y

i

2

1

n

(

i

y

i

)

2

{\displaystyle {\text{SST}}=\sum _{i=1}^{n}y_{i}^{2}-{\frac {1}{n}}\left(\sum _{i}y_{i}\right)^{2}}

回归平方和SSReg (sum of squares for regression。也可写做模型平方和,SSM,sum of squares for model) 是:

SSReg

=

(

y

^

i

y

¯

)

2

=

β

^

T

X

T

y

1

n

(

y

T

u

u

T

y

)

,

{\displaystyle {\text{SSReg}}=\sum \left({\hat {y}}_{i}-{\bar {y}}\right)^{2}={\hat {\boldsymbol {\beta }}}^{T}\mathbf {X} ^{T}\mathbf {y} -{\frac {1}{n}}\left(\mathbf {y^{T}uu^{T}y} \right),}

残差平方和SSE (sum of squares for error) 是:

SSE

=

i

(

y

i

y

^

i

)

2

=

y

T

y

β

^

T

X

T

y

.

{\displaystyle {\text{SSE}}=\sum _{i}{\left({y_{i}-{\hat {y}}_{i}}\right)^{2}}=\mathbf {y^{T}y-{\hat {\boldsymbol {\beta }}}^{T}X^{T}y} .}

总平方和SST又可写做SSReg和SSE的和:

SST

=

i

(

y

i

y

¯

)

2

=

y

T

y

1

n

(

y

T

u

u

T

y

)

=

SSReg

+

SSE

.

{\displaystyle {\text{SST}}=\sum _{i}\left(y_{i}-{\bar {y}}\right)^{2}=\mathbf {y^{T}y} -{\frac {1}{n}}\left(\mathbf {y^{T}uu^{T}y} \right)={\text{SSReg}}+{\text{SSE}}.}

回归系数R2是:

R

2

=

SSReg

SST

=

1

SSE

SST

.

{\displaystyle R^{2}={\frac {\text{SSReg}}{\text{SST}}}=1-{\frac {\text{SSE}}{\text{SST}}}.}