【概率统计】23考研一元线性回归笔记
1. Random Variables
$$Y_i=\beta_0+\beta_1X_i+\varepsilon_i$$
$$df=n-2$$
1-1. $\varepsilon_i$
Distribution
$$\varepsilon_i\sim N(0,\sigma^2)\ \text{i.i.d}$$
Properties
$$\frac{\varepsilon_i}{\sigma}\sim {\rm Std}N$$
$$\frac{\sum \varepsilon_i^2}{\sigma^2}\sim \chi^2(n-2)$$
Point Estimation
$$s^2_\varepsilon:=s^2={\rm MSE}_{\rm OBS}=\frac{{\rm SSE}_{\rm OBS}}{n-2}=\frac{\sum e_i^2}{n-2}$$
1-2. ${\rm SSE}/\sigma^2$
Definition
$${\rm SSE}:=\sum \varepsilon_i^2$$
Distribution
$$\frac{{\rm SSE}}{\sigma^2}=\frac{\sum \varepsilon_i^2}{\sigma^2}\sim \chi^2(n-2)$$
1-3. ${\rm MSE}/\sigma^2$
Definition
$${\rm MSE}:=\frac{{\rm SSE}}{n-2}$$
Distribution
$$\frac{{\rm MSE}}{\sigma^2}=\frac{{\rm SSE}/n-2}{\sigma^2}\sim \frac{\chi^2(n-2)}{n-2}$$
1-4. $Y_i$
Definition
$$Y_i:=\beta_0+\beta_1X_i+\varepsilon_i,\quad \beta_0,\beta_1,X_i\in \mathbb R$$
$\hat Y_i\in \mathbb R$
Definition
$$\hat Y_i=\beta_0+\beta_1X_i$$
Properties
$$Y_i=\hat Y_i+\varepsilon_i$$
$$\bar Y=\frac{1}{n}\sum \hat Y_i=\frac{1}{n}\sum (\beta_0+\beta_1X_i)=\beta_0+\beta_1\bar X$$
Distribution
$$Y_i\sim \hat Y_i+N(0,\sigma^2)=N(\hat Y_i,\sigma^2)\ \text{i.i.d}$$
1-5. $b_1$
Definition
$$b_1:=\frac{\ell_{xy}}{\ell_{xx}}=\frac{\sum (X_i-\bar X)(Y_i-\bar Y)}{\sum (X_i-\bar X)^2}=\sum \frac{X_i-\bar X}{\sum (X_i-\bar X)^2}Y_i$$
$k_i\in \mathbb R$
Definition
$$k_i:=\frac{\tilde X_i}{\ell_{xx}}=\frac{X_i-\bar X}{\sum (X_i-\bar X)^2}$$
Properties
$$b_1=\sum k_iY_i$$
$$\sum k_i=\frac{\sum(X_i-\bar X)}{\sum (X_i-\bar X)^2}=0$$
$$\sum k_iX_i=\frac{\sum(X_i-\bar X)X_i}{\sum (X_i-\bar X)^2}=\frac{\sum(X_i-\bar X)^2}{\sum (X_i-\bar X)^2}+\frac{\bar X\sum(X_i-\bar X)}{\sum (X_i-\bar X)^2}=1$$
$$\sum k_i^2=\frac{\sum (X_i-\bar X)^2}{\left(\sum (X_i-\bar X)^2\right)^2}=\frac{1}{\sum (X_i-\bar X)^2}=\frac{1}{\ell_{xx}}$$
Distribution
$$b_1=\sum k_iY_i\sim \sum k_iN(\hat Y_i,\sigma^2)=N(\sum k_i\hat Y_i,\sum k_i^2\sigma^2)$$
$$\sum k_i\hat Y_i=\sum k_i(\beta_0+\beta_1X_i)=\beta_0\sum k_i+\beta_1\sum k_iX_i=\beta_1$$
$$\sum k_i^2\sigma^2=\sigma^2\sum k_i^2=\frac{\sigma^2}{\ell_{xx}}$$
$$b_1\sim N\left(\beta_1,\frac{\sigma^2}{\ell_{xx}}\right)$$
Point Estimation
$$\hat b_1=\beta_1$$
$$s^2_{b_1}=\frac{{\rm MSE}}{\ell_{xx}}=\frac{{\rm MSE}}{\sum (X_i-\bar X)^2}$$
$(Z-\mu)/s\sim t(df)$
Distribution
$$\frac{Z-\mu}{\sigma}\sim {\rm Std}N,\qquad\frac{s}{\sigma}=\sqrt{\frac{{\rm MSE}}{\sigma^2}}\sim\sqrt{\frac{\chi^2(df)}{df}}$$
$$\frac{Z-\mu}{s}=\frac{Z-\mu}{\sigma}\Big/\frac{s}{\sigma}\sim \frac{{\rm Std}N}{\sqrt{\chi^2(df)/df}}=t(df)$$
1-6. $b_0$
Definition
$$b_0:=\bar Y-b_1\bar X$$
Distribution
$\bar Y\sim N(\beta_0+\beta_1\bar X,\sigma^2/n)$
Definition
$$\bar Y:=\frac{1}{n}\sum Y_i\sim \frac{1}{n}\sum N(\hat Y_i,\sigma^2)=N\left(\frac{\sum \hat Y_i}{n},\frac{\sigma^2}{n}\right)=N\left(\beta_0+\beta_1\bar X,\frac{\sigma^2}{n}\right)$$
Properties
$$Cov(\bar Y,b_1)=Cov\left(\frac{1}{n}\sum Y_i,\sum k_iY_i\right)=\frac{\sum k_i}{n}\sigma^2=0$$
$$\begin{aligned}b_0&=\bar Y-b_1\bar X\\&\sim N\left(\beta_0+\beta_1\bar X,\frac{\sigma^2}{n}\right)-\bar XN\left(\beta_1,\frac{\sigma^2}{\ell_{xx}}\right)\\&=N\left(\beta_0,\sigma^2\left(\frac{1}{n}+\frac{\bar X^2}{\ell_{xx}}\right)\right)\end{aligned}$$
1-7. $Y_h$
Definition
$$Y_h:=b_0+b_1X_h$$
Distribution
$$\begin{aligned}Y_h&=\bar Y+b_1\tilde X_h\\&\sim N\left(\beta_0+\beta_1\bar X,\frac{\sigma^2}{n}\right)+\tilde X_hN\left(\beta_1,\frac{\sigma^2}{\ell_{xx}}\right)\\&=N\left(\beta_0+\beta_1X_h,\sigma^2\left(\frac{1}{n}+\frac{\tilde X_h^2}{\ell_{xx}}\right)\right)\end{aligned}$$
Confidence Interval
$$y_h\in[\hat Y_h\pm t_{1-\frac{\alpha}{2}}(n-2)s_h]$$
Working-Hotelling Confidence Bend
$$y_h\in[\hat Y_h\pm Ws_h]$$
$$W^2=2F_{1-\alpha}(2,n-2)$$
Remark
Average response of $\infty$ predictions
1-8. $Y_{\rm pred}$
Definition
$$Y_{\rm pred}=Y_h+\varepsilon_{\rm pred},\qquad \varepsilon_{\rm pred}\sim N(0,\sigma^2)$$
Distribution
$$Y_{\rm pred}\sim N\left(\beta_0+\beta_1X_h,\sigma^2\left(1+\frac{1}{n}+\frac{\tilde X_h^2}{\ell_{xx}}\right)\right)$$
Remark
Response of single prediction
1-9. $Y_{\rm predmean}$
Definition
$$Y_{\rm predmean}=Y_h+\frac{1}{m}\sum_{j=1}^m\varepsilon_{\rm pred_j},\qquad \varepsilon_{\rm pred_j}\sim N(0,\sigma^2)\ \text{i.i.d}$$
Distribution
$$Y_{\rm predmean}\sim N\left(\beta_0+\beta_1X_h,\sigma^2\left(\frac{1}{m}+\frac{1}{n}+\frac{\tilde X_h^2}{\ell_{xx}}\right)\right)$$
Remark
Average response of $m$ predictions
Relations Between $Y_h,\ Y_{\rm pred},\ Y_{\rm predmean}$
$$Y_h=Y_{\rm predmean}(m=\infty)$$
$$Y_{\rm pred}=Y_{\rm predmean}(m=1)$$
1-10. $e_i$
Definition
$$e_i=Y_i-b_0-b_1X_i$$
Distribution
$$e_i\sim N\left(0,\sigma^2\left(1-\frac{1}{n}-\frac{\tilde X_i^2}{\ell_{xx}}\right)\right)$$
$${\rm Cov}(e_i,e_j)=\sigma^2\left(-\frac{1}{n}-\frac{\tilde X_i\tilde X_j}{\ell_{xx}}\right)$$
1-11. Summary
Variable | Distribution | Mean | Variance | Covariance |
---|---|---|---|---|
$\varepsilon_i$ | $N$ | $0$ | $\sigma^2$ | $0$ |
${\rm SSR}/\sigma^2$ | $\chi^2(n-2)$ | |||
${\rm MSR}/\sigma^2$ | $\frac{\chi^2(n-2)}{n-2}$ | |||
$Y_i$ | $N$ | $\hat Y_i$ | $\sigma^2$ | $0$ |
$b_1$ | $N$ | $\beta_1$ | $\frac{\sigma^2}{\ell_{xx}}$ | |
$b_0$ | $N$ | $\beta_0$ | $\sigma^2\left(\frac{1}{n}+\frac{\bar X^2}{\ell_{xx}}\right)$ | |
$Y_{\rm predmean}$ | $N$ | $\beta_0+\beta_1X_h$ | $\sigma^2\left(\frac{1}{m}+\frac{1}{n}+\frac{\tilde X_h^2}{\ell_{xx}}\right)$ | |
$e_i$ | $N$ | $0$ | $\sigma^2\left(1-\frac{1}{n}-\frac{\tilde X_i^2}{\ell_{xx}}\right)$ | $\sigma^2\left(-\frac{1}{n}-\frac{\tilde X_i\tilde X_j}{\ell_{xx}}\right)$ |
1-12. ANOVA Table
ANOVA | ${\rm SS}$ | $df$ | ${\rm MS}$ | $E({\rm MS})$ |
---|---|---|---|---|
Regression | ${\rm SSR}=\sum(\hat Y_i-\bar Y)^2$ | $1$ | ${\rm MSR}={\rm SSR}$ | $\sigma^2+\beta_1^2\ell_{xx}^2$ |
Error | ${\rm SSE}=\sum(Y_i-\hat Y_i)^2$ | $n-2$ | ${\rm MSE}=\frac{{\rm SSE}}{n-2}$ | $\sigma^2$ |
Total | ${\rm SSTO}=\sum(Y_i-\bar Y)^2$ | $n-1$ |
- Coefficient of Determination $R^2:={\rm SSR}/{\rm SSTO}$
- Coefficient of Correlation $r:=\mathop{\mathrm {sgn}}b_1\cdot \sqrt{R^2}=\ell_{xy}/\sqrt{\ell_{xx}\ell_{yy}}$
2. Estimations & Tests
2-1. $\beta_1$
Interval Estimation
$$t_{\frac{\alpha}{2}}(n-2)\leq \frac{b_1-\beta_1}{s_{b_1}}\leq t_{1-\frac{\alpha}{2}}(n-2)$$
$$\beta_1\in\left[b_1\pm t_{1-\frac{\alpha}{2}}(n-2)s_{b_1}\right]$$
Tests
$$H_0:\beta_1=0\quad\text{vs}\quad H_1:\beta_1\neq 0$$
- $t$-Test
$$t:=\frac{b_1}{s_{b_1}},\qquad W=\Big\{|t|>t_{1-\frac{\alpha}{2}}(n-2)\Big\}$$ - $F$-Test
$$F:=\frac{{\rm MSR}}{{\rm MSE}}\sim F(1,n-2),\qquad W=\Big\{|F|>F_{1-\alpha}(1,n-2)\Big\}$$ - General Linear Test
$$R:\text{Reduced Model},\qquad F:\text{Full Model}$$
$$F:=\frac{{\rm SSE}_R-{\rm SSE}_F}{df_R-df_F}\Big/\frac{{\rm SSE}_F}{df_F},\qquad W=\Big\{|F|>F_{1-\alpha}(df_R-df_F,df_F)\Big\}$$
$$R:Y_i=\beta_0+\varepsilon_i,\qquad F:Y_i=\beta_1X_i+\beta_0+\varepsilon_i$$
$${\rm SSE}_R={\rm SST},\qquad {\rm SSE}_F={\rm SSE}$$
$$df_R=n-1,\qquad df_F=n-2$$
$$F={\rm MSR}/{\rm MSE}$$
2-2. $\beta_0$
Interval Estimation
$$t_{\frac{\alpha}{2}}(n-2)\leq \frac{b_0-\beta_0}{s_{b_0}}\leq t_{1-\frac{\alpha}{2}}(n-2)$$
$$\beta_0\in\left[b_0\pm t_{1-\frac{\alpha}{2}}(n-2)s_{b_0}\right]$$
3. Variance Normality & Constancy
3-1. Studentized and Semi-studentized Residuals
$$s_{e_i}=\sigma_{e_i}\Big|_{\sigma=\sqrt{\rm MSE}},\qquad s_{e_i}^{(\rm stu)}=\frac{e_i}{s_{e_i}},\qquad s_{e_i}^{(\rm semistu)}=\frac{e_i}{\sqrt{\rm MSE}}$$
3-2. Normal Q-Q Plot of Residuals
$$E(\varepsilon_i\mid \text{$e_i$ is the $k$-th smallest among $e$})\approx u_{\frac{k-3/8}{n+1/4}}\sqrt{{\rm MSE}}$$
- Plot $y=E(\varepsilon_i\mid k),\ x=e_i$
- Normal residuals if scatters near $y=x$
3-3. Brown-Forsythe Test for Variance Constancy
$$H_0:\varepsilon={\rm const}\quad\text{vs}\quad H_1:\varepsilon\neq {\rm const}$$
$$\mathcal X_{\rm OBS}=\mathcal X^{(1)}\cup \mathcal X^{(2)},\qquad \mathcal X^{(1)}=\{x\in\mathcal X_{\rm OBS}\mid x<\bar x\}$$
$$d^{(1)}_i=|e^{(1)}_i-e^{(1)}_{\rm mid}|,\qquad s^2=\frac{\sum \tilde d^{(1)}_i+\sum \tilde d^{(2)}_i}{n-2}$$
$$t_{\rm BF}=\frac{\bar d_1-\bar d_2}{s\sqrt{\frac{1}{n_1}+\frac{1}{n_2}}},\qquad W=\Big\{|t_{\rm BF}|>t_{1-\frac{\alpha}{2}}(n-2)\Big\}$$
3-4. Breusch-Pagan Test for Variance Constancy of Large Sample
$$\ln \sigma_i^2=\gamma_0+\gamma_1X_i$$
$$H_0:\varepsilon={\rm const}\iff \gamma_1=0\quad\text{vs}\quad H_1:\varepsilon\neq {\rm const}$$
$$\chi^2_{\rm BP}=\frac{{\rm SSR}_\gamma}{2}\Big/\left(\frac{{\rm SSE}_\beta}{n}\right)^2\ \dot\sim\ \chi^2(1)\qquad W=\Big\{|\chi^2_{\rm BP}|>\chi^2_{1-\alpha}(1)\Big\}$$
4. Samples of Repeat $X$
4-1. Dividing of $\mathcal X_{\rm OBS}$
Divide $\mathcal X_{\rm OBS}$ into groups by same $X$
$$\mathcal X_{\rm OBS}=\bigcup \mathcal X^{(j)},\quad j=1,\cdots,c,\qquad {\rm SSE}={\rm SSPE}+{\rm SSLF}$$
$${\rm SSPE}=\sum_j\sum_i(Y_i^{(j)}-\bar Y^{(j)})^2,\qquad {\rm SSLF}=\sum_j\sum_i(\bar Y^{(j)}-\hat Y_i^{(j)})^2$$
ANOVA | ${\rm SS}$ | $df$ | ${\rm MS}$ |
---|---|---|---|
Regression | ${\rm SSR}=\sum\sum(\hat Y_i^{(j)}-\bar Y)^2$ | $1$ | ${\rm MSR}={\rm SSR}$ |
Error | ${\rm SSE}=\sum\sum(Y_i^{(j)}-\hat Y_i^{(j)})^2$ | $n-2$ | ${\rm MSE}=\frac{{\rm SSE}}{n-2}$ |
Lack of Fit | ${\rm SSLF}=\sum\sum(\bar Y^{(j)}-\hat Y_i^{(j)})^2$ | $c-2$ | ${\rm MSLF}=\frac{{\rm SSLF}}{c-2}$ |
Pure Error | ${\rm SSPE}=\sum\sum(Y_i^{(j)}-\bar Y^{(j)})^2$ | $n-c$ | ${\rm MSPE}=\frac{{\rm SSPE}}{n-c}$ |
Total | ${\rm SSTO}=\sum\sum(Y_i^{(j)}-\bar Y)^2$ | $n-1$ |
$$E({\rm SSPE})=\sigma^2,\quad E({\rm SSLF})=\sigma^2+\frac{\sum n_j(\mu_j-\hat Y_j)^2}{c-2}$$
4-2. $F$-Test of Lack-of-fit with repeat $X$
$$H_0:EY=\beta_0+\beta_1X\quad\text{v.s}\quad H_1:EY\neq \beta_0+\beta_1X$$
$$F:X_i^{(j)}=\mu^{(j)}+\varepsilon_i^{(j)},\qquad R:Y_i=\beta_1X_i+\beta_0+\varepsilon_i$$
$${\rm SSE}_F={\rm SSPE},\qquad {\rm SSE}_R={\rm SSE}$$
$$df_F=n-c,\qquad df_R=n-2$$
$$F={\rm MSLF}/{\rm MSPE}$$
4-3. Box-Cox Transformations of Regression to $Y^\lambda$ or $\ln Y$
$$Y^\lambda=\left\{\begin{aligned} &Y^\lambda,\qquad&&\lambda\neq 0,\\ &\ln Y,\qquad&&\lambda=0 \end{aligned}\right.$$
$$Y^\lambda_i=\beta_0+\beta_1X_i+\varepsilon_i$$
5. Simultaneous PI & CI
$${\rm CI}_0=b_0\pm t_{1-\frac{\alpha}{2}}(n-2)s_{b_0},\qquad {\rm CI}_1=b_1\pm t_{1-\frac{\alpha}{2}}(n-2)s_{b_1}$$
$$\Pr(\beta_0\notin{\rm CI}_0)=\Pr(\beta_1\notin{\rm CI}_1)=\alpha$$
$$\Pr(\beta_0\in{\rm CI}_0\land \beta_1\in{\rm CI}_1)=1-2\alpha$$
5-1. Joint CI of $\beta_0,\beta_1$
$$B:=t_{1-\frac{\alpha}{4}}(n-2)$$
$${\rm BonfCI}_0=b_0\pm Bs_{b_0},\qquad{\rm BonfCI}_1=b_1\pm Bs_{b_1}$$
$$\Pr(\beta_0\in{\rm BonfCI}_0\land \beta_1\in{\rm BonfCI}_1)=1-\alpha$$
5-2. Simultaneous $Y_h$ CI of $\{X_{h_i}\}_{i=1}^g$
$$\hat Y_h\pm Us_{Y_h}$$
Bonferroni CI
$$U=B_\alpha(g)=t_{1-\frac{\alpha}{2g}}(n-2)$$
Working-Hotelling CI
$$U=W_\alpha=\sqrt{2F_{1-\alpha}(2,n-2)}$$
5-3. Simultaneous $Y_{\rm pred}$ PI of $\{X_{h_i}\}_{i=1}^g$
$$\hat Y_{\rm pred}\pm Us_{Y_{\rm pred}}$$
Bonferroni PI
$$U=B_\alpha(g)=t_{1-\frac{\alpha}{2g}}(n-2)$$
Scheffe PI
$$U=S_\alpha(g)=\sqrt{gF_{1-\alpha}(g,n-2)}$$
6. Regression Assuming $\beta_0=0$
$$Y_i=\beta_1X_i+\varepsilon_i$$
6-1. $b_1$
$$b_1=\frac{\sum X_iY_i}{\sum X_i^2}=\frac{\ell_{xy}}{\ell_{xx}}\Big|_{\bar X=\bar Y=0}\sim N\left(\beta_1,\frac{\sigma^2}{\ell_{xx}}\right)\Big|_{\bar X=0},\qquad df=n-1$$
6-2. $e_i$
$$e_i\sim N\left(0,\sigma^2\left(1-\frac{X_i^2}{\ell_{xx}}\right)\right)\Big|_{\bar X=0},\qquad {\rm Cov}(e_i,e_j)=\sigma^2\left(-\frac{X_iX_j}{\ell_{xx}}\right)\Big|_{\bar X=0}$$
6-3. $Y_{\rm pred}$
$$Y_{\rm pred}\sim N\left(\beta_1X_h,\sigma^2\left(1+\frac{X_h^2}{\ell_{xx}}\right)\right)\Big|_{\bar X=0}$$
6-4. ANOVA Table
ANOVA | ${\rm SS}$ | $df$ | ${\rm MS}$ | $E({\rm MS})$ |
---|---|---|---|---|
Regression | ${\rm SSRU}=\sum\hat Y_i^2$ | $1$ | ${\rm MSRU}={\rm SSRU}$ | $\sigma^2+\beta_1^2\ell_{xx}^2\mid_{\bar X=0}$ |
Error | ${\rm SSE}=\sum(Y_i-\hat Y_i)^2$ | $n-1$ | ${\rm MSE}=\frac{{\rm SSE}}{n-1}$ | $\sigma^2$ |
Total | ${\rm SSTOU}=\sum Y_i^2$ | $n$ |
6-5. Test of $\beta_1=0$
$$H_0:\beta_1=0\quad\text{vs}\quad H_1:\beta_1\neq 0$$
$$F:=\frac{{\rm MSRU}}{{\rm MSE}}\sim F(1,n-1),\qquad W=\Big\{|F|>F_{1-\alpha}(1,n-1)\Big\}$$