Chapter 6 Multiple Linear Regression Analysis Copyright © 2014 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education. Learning Objectives • Understand the goals of multiple linear regression analysis • Understand the “holding all other variables constant” condition in multiple linear regression analysis • Understand the multiple linear regression assumption required for OLS to be BLUE • Interpret multiple linear regression output in excel • Assess the goodness-of-fit of the estimated sample regression function 6-2 Learning Objectives • Perform hypothesis tests for the overall significance of the estimated sample regression function • Perform hypothesis tests for the individual significance of an estimated slope coefficient • Perform hypothesis tests for the joint significance of a subset of estimated slope coefficients • Perform the chow test for structural differences between two subsets of data 6-3 6-4 The Multiple Regression Model Idea: Examine the linear relationship between one dependent variable, y, and two or more independent variables, x1, x2,…xk Population model: Y-intercept Population slopes Random Error y β0 β1x1 β2 x 2 βk xk ε Estimated multiple regression model: Estimated (or predicted) value of y Estimated intercept Estimated slope coefficients ŷ ˆ0 ˆ1x1 ˆ2 x 2 ˆk x k 6-5 A Visual Depiction of the Estimated Sample Multiple Linear Regression Function 6-6 A Visual Depiction of the Predicted Value of and the Calculated Residual for a Given Observation 6-7 A Visual Depiction of the Predicted Values of and the Calculated Residuals for Multiple Observations 6-8 How are the Multiple Linear Regression Estimates Obtained? Minimize the sum of squared residuals min n n i 1 i 1 2 2 ˆ ( y y ) e i i i n ( yi ˆ0 ˆ1 x1i ˆ1 x2i ... ˆ1 xki ) 2 i 1 Unlike simple linear regression, there is no formula in summation notation for the intercept and slope coefficient estimates. 6-9 Understand the “Holding All Other Independent Variables Constant” Condition The idea behind holding all other factors constant (or ceteris paribus) is that we want to isolate the effects of a specific x on the dependent variable without any other factors changing the independent variable or the dependent variable. 6-10 A Venn Diagram of the Estimated Linear Relationship between y and x1 Assuming No Factors in the Error Term Affect x1 6-11 A Venn Diagram of the Estimated Multiple Linear Relationship between y, x1, and x2 6-12 A Comparison of the Estimated Marginal Effect of y and x1 in the Given Venn Diagrams This is omitted variable bias. It is the bias in x1 that results from x2 not being included in the model and x2 being related to both x1 and y. 6-13 When is Omitted Variable Bias Not Present? (1) If x2 is not related to y – then x2 is not in the error term and does not have to be held constant when x1 changes. (2) If x2 is not related to x1 – then x2 will not change when x1 changes. 6-14 Understand the Multiple Linear Regression Assumptions Required for OLS to be the Best Linear Unbiased Estimator Assumptions Required for OLS to be Unbiased Assumption M1: The model is linear in the parameters Assumption M2: The data are collected through independent, random sampling Assumption M3: The data are not perfectly multicollinear. Assumption M4: The error term has zero mean Assumption M5: The error term is uncorrelated with each independent variable and all functions of each independent variable. Additional Assumption Required for OLS to be BLUE Assumption M6: The error term has constant variance. Note that these assumptions are theoretical and typically can’t be proven or disproven. 6-15 Assumption S1: Linear in the Parameters This assumption states that for OLS to be unbiased, the population model must be correctly specified as linear in the parameters. y i β 0 β1x1,i β 2 x 2,i β k x k,i ε i 6-16 When is Assumption S1 Violated? (1) If the population regression model is non-linear in the parameters, i.e. y i β 0 (ln β1 )x 1,i β 2 x 2,i β k x k,i ε i (2) If the true population model is not specified correctly, i.e. if the true model is y i β 0 β1 (ln x 1,i ) β 2 x 2,i β k x k,i ε i but the model on the previous slide is the one that is estimated. 6-17 Assumption S2: The Data are Collected through Simple Random Sampling This assumption states that for OLS to be unbiased, the data must be obtained through simple random sampling. This assumption ensures that the observations are statistically independent of each other across the units of observations. 6-18 When is Assumption S2 Violated? (1) If the data are time series data such as GDP and interest rates for the US collected over time. In this circumstance observations from this time period are likely related to observations in previous time periods. (2) If there is some type of selection bias in the sampling. For example, if individuals opt to be in a job training program, go to college, or the response rate for a survey is low. 6-19 Assumption S3: The Data are Not Perfectly Multicollinear This assumption states that for OLS to be unbiased, each independent variable cannot be all the same value or for j = 1, …., k ( x x )2 0 j ,i j This assumption also states that one of the independent variables is not a linear combination of another independent variable. This assumption ensures that slope estimator is defined. This assumption is only violated if the model falls into the dummy variable trap. 6-20 Assumption S4: The Error Term has Zero Mean This assumption states that for OLS to be unbiased, the average value of the population error term is zero or E ( ) 0 This assumption will hold as long as an intercept is included in the model. This is because if the average value of the error term equals a value other than zero then the intercept will change accordingly. 6-21 Assumption S5: The Error Term is Not Correlated with each Independent Variable or Any Function of each Independent Variable This assumption states that for OLS to be unbiased, the error term is uncorrelated with the independent variable and all functions of the independent variable E ( | xij ) 0 for all j. This is read as the expected value of ε given xij is equal to 0. 6-22 How to Determine if Assumption S5 Violated? (1) Think of all the factors that affect the dependent variable that are not specified in the model. For the salary vs. education example variables that are in the error term include experience, ability, job type, gender, and many other factors. (2) If any of these factors, say ability, are related to any of the independent variables, say education, then violation S5 is violated. Note that the error term is never observed so determining whether S5 is violated is only a thought experiment. 6-23 The Importance of S1 through S5 If assumptions S1 through S5 hold, then the OLS estimates are unbiased. This assumption is less likely to be violated in multiple linear regression analysis than simple linear regression analysis but for nonexperimental data (i.e. the type of data economists use) that these assumptions almost always fail and therefore the OLS estimates are typically biased. 6-24 Assumption S6: The Error Term has Constant Variance This assumption states that the error term is has a constant variance or in equation form VAR( ) 2 This is called homoskedasticity. If this assumption fails then the error term is heteroskedastic or the error term has a nonconstant variance. VAR( ) 2 i 6-25 How to Determine if Assumption S6 Violated? Create a scatter plot of y against each x and decide whether the points are scattered in a constant manner around the line. Heteroskedasticity does not have to look like the graph on the right on the next slide, there just has to be a non-constant distribution of the data points along the line. Chapter 9 gives a more in depth coverage of this topic. 6-26 Visual Depiction of Homoskedasticity versus Heteroskedasticity 6-27 The Importance of S1 through S6 If assumptions S1 through S6 hold, then the OLS estimates are BLUE or the Best Linear Unbiased Estimators. In this instance Best means minimum variance. This means that among all linear unbiased estimators of the population slope and population intercept, the OLS estimates have the lowest variance. As before, in simple linear regression analysis in economics these assumption rarely hold. 6-28 Interpret Multiple Linear Regression in Excel: Data Set Model: housepricei β 0 β1sqfeeti β 2 bedroomsi ε i 6-29 Scatter Diagrams From these scatter diagrams it is evident that both square feet and bedrooms have a positive linear association with the price of a house House Price vs. Square Feet 450000 400000 300000 250000 200000 150000 House Price vs. Bedrooms 100000 50000 450000 0 400000 0 500 1000 1500 2000 2500 3000 350000 Square Feet House Price House Price 350000 300000 250000 200000 150000 100000 50000 0 0 1 2 3 4 5 6 Bedrooms 6-30 Interpret Multiple Linear Regression in Excel: Regression output Estimated Sample Regression Function : housepricei 89,267.43 56.11sqfeeti 30,606.62bedroomsi β̂ 0 β̂1 x1 β̂ 2 x2 6-31 Interpret Multiple Linear Regression in Excel: Interpreting the Output Estimated Sample Regression Function : housepricei 89,267.43 56.11sqfeeti 30,606.62bedroomsi β̂ 0 : On average, if square feet and bedrooms are 0, then the predicted house price is $89,267.43. β̂1 : On average, holding bedrooms constant, if square footage increases by one foot then the price of the house increases by $56.11. β̂ 2 : On average, holding square footage constant, if the number of bedrooms increases by one then the price of the house increases by $30,606.62. 6-32 Interpret Multiple Linear Regression in Excel: Obtaining a Predicted Value Estimated Sample Regression Function : housepricei 89,267.43 56.11sqfeeti 30,606.62bedroomsi Suppose we wish to predict the price of a house with 2,000 square feet and 3 bedrooms. housepricei 89,267.43 (56.11)(2,000) (30,606.62)(3) housepricei $293,309.71 The predicted price of a house is $293,309.71. 6-33 Assess the Goodness of Fit of the Sample Multiple Linear Regression Function: R2 ExplainedSS 21,998,347,856 R 0.6748 TotalSS 32,600,500,000 2 The R2 means that 67.48% of the variation in housing price can be explained by square feet and bedrooms. 6-34 Assess the Goodness of Fit of the Sample Multiple Linear Regression Function: Adjusted R2 Un exp lainedSS R 2 1 21,998,347,856 n k 1 10 3 0.5819 TotalSS 32,600,500,000 n 1 9 The adjusted R2 imposes a penalty for adding in additional explanatory variables. The penalty is that in the numerator as k goes up the adjusted R2 goes down (if USS is held constant). 6-35 Assess the Goodness of Fit of the Sample Multiple Linear Regression Function: Standard Error of the Regression Model s y| x Un exp lainedSS 10,602,152,154 38,917.77 n k 1 7 The standard error of the regression can also be calculated by taking the square root of the MSUnexplained. 6-36 Perform Hypothesis Tests for the Overall Significance of the Sample Regression Function • F-Test for Overall Significance of the Model • Shows if there is a linear relationship between any of the independent variables considered together and the dependent variable, y • Use F test statistic • Hypotheses: – H0: β1 = β2 = … = βk = 0 (no linear relationship) – H1: at least one βi ≠ 0 (at least one independent variable affects y) 6-37 F-Statistic for Overall Significance • Test statistic: ExplainedSS MSExplained k F Stat Un exp lainedSS MSUn exp lained n k 1 where F has (numerator) D1 = k and (denominator) D2 = (n – k - 1) degrees of freedom 6-38 Rejection Rules for the F-Test for the Overall Significance of the Regression Model Critical Value: Reject H0 if F-Stat > Fα, k, n-k-1 P-value: Reject H0 if p-value < α (the p-value for this test is found under Significance F in the ANOVA table in Excel) 6-39 F-Test for Overall Significance MSExplaine d 10,999,173,923 F 7.2621 MSUnexplai ned 1,514,593,165 With 2 and 7 degrees of freedom P-value for the F-Test 6-40 F-Test for Overall Significance Test Statistic: MSR F 7.2621 MSE H0: β1 = β2 = 0 H1: β1 and β2 not both zero p value 0.0196 = .05 df1= 2 df2 = 7 Rejection Rule: Critical Value: Reject H0 if F-stat > 4.737 or F = 4.737 Reject H0 if p-value < .05 = .05 0 Do not reject H0 Reject H0 F.05 = 4.737 Conclusion: F Because 7.2621 (or alternatively because 0.0196 < .05), we reject H0 and conclude that at least one of square footage or bedrooms affects the price of a house. 6-41 Are Individual Independent Variables Significant? • Use t-tests of individual variable slopes • Shows if there is a linear relationship between the variable xi and y • Hypotheses: – H0: βi = 0 (no linear relationship) – H1: βi ≠ 0 (linear relationship does exist between xi and y) 6-42 Are Individual Independent Variables Significant? H0: βi = 0 (no linear relationship) H1: βi ≠ 0 (linear relationship does exist between xi and y) Three ways to test this hypothesis (1) Confidence Interval (2) Critical Value (3) p-value 6-43 Using a Confidence Interval to test Individual Statistical Significance H0: βi = 0 (no linear relationship between xi and y) H1: βi ≠ 0 (linear relationship exists between xi and y) ˆi t / 2,nk 1sˆ i Reject H0 if 0 is not within the confidence interval. The α is 1 – the confidence level. The confidence level is usually 95% so α = .05 6-44 Confidence Interval Estimate for the Slope Confidence interval for the population slope β1 (the effect of changes of square feet on house prices): ˆi t.05/ 2,1021sˆ i 56.1112 (2.36)( 48.8592) (59.1965, 171.4189) Decision: This confidence interval includes 0 so we fail to reject H0 and conclude that square feet does not have a statistically significant effect on the price of a house at the 5% level. The interval is different from the Excel output due to rounding. 6-45 Using Critical Values to test Individual Statistical Significance H0: βi = 0 (no linear relationship) H1: βi ≠ 0 (linear relationship exists between xi and y) Test Statistic: t statistic ˆi 0 sˆ i Rejection Rule: Reject H0 if |t-statistic| > t α, n-k-1 6-46 Using Critical Values to Test for Individual Significance of Square Feet (x1) t statistic ˆ1 0 sˆ 56.1112 0 1.1484 t.025, 7 2.36 48.8592 1 Rejection Rule: Reject H0 if |t-statistic| > 2.36 Decision: Because 1.1484 < 2.36, we fail to reject H0 and conclude that square feet does not have a statistically significant effect on the price of a house at the 5% level. 6-47 Using p-values to Test Individual Statistical Significance H0: βi = 0 (no linear relationship) H1: βi ≠ 0 (linear relationship exists between xi and y) Test Statistic: t statistic ˆi 0 sˆ i p value 2 * P( Z | t statistic |) (Usually the p-value is found on the Excel Output) Rejection Rule: Reject H0 if p-value < α 6-48 Using p-value to Test for Individual Significance of Square Feet (x1) t statistic ˆ1 0 sˆ 56.1112 0 1.1484 p value 0.2885 48.8592 1 Rejection Rule: Reject H0 if p-value < .05 Decision: Because 0.2885 > .05, we fail to reject H0 and conclude that square feet does not have a statistically significant effect on the price of a house at the 5% level. 6-49 Things to note about the different methods for tests of individual significance (1) All three methods yield the same conclusions. (2) To test for individual significance of bedrooms instead of square footage follow the same process but use the row below square footage Using any of the three methods we see that bedrooms is also statistically insignificant at the 5% level 6-50 What is multicollinearity?* Multicollinearity is when two of the independent variables are highly linearly related. Note that multicollinearity is not perfect multicollinearity. Perfect multicollinearity implies that the correlation coefficient is 1 in absolute value. Multicollinearity means that the correlation coefficient is high but not perfect between two independent random variables. *Note: This material is not covered in the textbook 6-51 Venn Diagram Explanation of Multicollinearity 6-52 What are the Implications of Multicollinearity? Unlike perfect multicollinearity OLS estimates can still be obtained. OLS estimates are still unbiased. Standard errors are large because there is very little information that goes into the estimation of each of the slopes. 6-53 Perform Hypothesis Tests for the Joint Significance of a Subset of Slope Coefficients The original regression model is y i β 0 β1x1,i β 2 x 2,i β 3 x 3,i ε i After testing for individual significance, x2 and x3 are individually statistically significant at the 5% level. The researcher would like to know if x2 and x3 are jointly statistically significant. 6-54 Perform Hypothesis Tests for a Subset of Explanatory Variables This is an F-Test for joint statistical significance Hypothesis: H0: β2 = β3 = 0 (no linear relationship) H1: at least one of β2 or β3 explains y Unrestricted model (the original model) y i β 0 β1x1,i β 2 x 2,i β 3 x 3,i ε i Restricted model (the model with the null hypothesis imposed, in this case β3 = β4 = 0) yi β0 β1x1,i ε * i 6-55 F-Statistic for Overall Significance Test statistic: (Un exp lainedSS restricted Un exp lainedSS unrestricted ) q F Statistic Un exp lainedSS unrestricted n k 1 or 2 2 ( Runrestrict R ed restricted ) q F Statistic 2 (1 Runrestrict ed ) n k 1 where q is the number of restrictions (the number of equal signs in the null hypothesis, in this case 2) 6-56 Rejection Rules for the F-Test for the Overall Significance of the Regression Model Critical Value: Reject H0 if F-Stat > Fα, q, n-k-1 For this test, it is necessary to run two regressions (1) Unrestricted Regression (2) Restricted Regression 6-57 For the Housing Price Example The original model is housepricei β 0 β1lotsizei β 2 sqfeeti β3bedroomsi ε i UnexplainedSSunrestricted Using the p-values, lot size is individually statistically significant at the 5% level but square feet and bedrooms are statistically insignificant at the 5% level. 6-58 Testing if Square Feet and Bedrooms are Jointly Equal to 0 Hypothesis: H0: β2 = β3 = 0 (no joint linear relationship) H1: at least one of β2 or β3 explains y Restricted model (the model with the null hypothesis imposed, in * this case β3 = β4 = 0) housepricei β0 β1lotsize1,i ε i UnexplainedSSrestricted 6-59 F-Statistic for Joint Significance Test statistic: (621,726,567 456,205,475) 2 F Statistic 1.0885 456,205,475 6 Reject H0 if F-Stat > F.05, 2, 6 Reject H0 if F-Stat > 5.143 Decision: Because 1.0885 is not greater than 5.143 we reject H0 and conclude that square feet and bedrooms do not jointly affect house price. 6-60 F-Statistic for Joint Significance Using R2 Test statistic: 2 2 ( Runrestrict R (0.9860 0.9809) ed restricted ) q 2 F Statistic 1.0885 2 (1 0.9860) (1 Runrestricted ) 6 n k 1 Reject H0 if F-Stat > F.05, 2, 6 Reject H0 if F-Stat > 5.143 Decision: Because 1.0885 is not greater than 5.143 we reject H0 and conclude that square feet and bedrooms do not jointly affect house price. Notice that we obtained the same F-Statistic using SSUnexplained as we did using R2. 6-61 Chow Test Use to test if there are statistical differences between two groups such as men and women, those who have graduated from college and those who haven’t, ect. For the Chow test run three regressions (1) The entire data set all together and the USS is the UnexplainedSSrestricted (2) One subset of the data (i.e. only the men) and the USS is the UnexplainedSS1 (3) The other subset of the data (i.e. only the women) and the USS is UnexplainedSS2 6-62 The Hypothesis, Test Statistic, and Rejection Rule H0: There are no differences between the two groups H1: There is at least one difference between the two groups (USSrestricted USS1 USS2 ) k 1 F Statistic USS1 USS2 2(n k 1) Rejection Rule: Reject H0 if F-Stat > Fα, k+1,2( n-k-1) If the null hypothesis is rejected then we conclude that a difference exists between the two groups either in the intercepts, slopes or both. 6-63 Creating a Confidence Interval Around a Prediction in Multiple Linear Regression * The formula for the confidence interval is yˆ p t / 2,n k 1s pˆ where ŷ p is the predicted value, t / 2,n k 1 is the critical value from the t-table, and s pis ˆ the standard error of the prediction. The only component that we don’t know how to obtain is the standard error of the prediction. *Note: This material is not covered in the textbook 6-64 Finding the Standard Error of the Prediction There is not a straightforward formula for the standard error of the prediction like there is in simple linear regression To find this standard error we need to create new variables and run an additional regression. The new variables that need to be created are for each observation and for each independent variable subtract off the value you are interested in predicting for. 6-65 Original Regression Results 6-66 The Housing Price Example from Before: Estimated Sample Regression Function : housepricei 89,267.43 56.11sqfeeti 30,606.62bedroomsi Suppose we wish to predict the price of a house with 2,000 square feet and 3 bedrooms. housepricei 89,267.43 (56.11)(2,000) (30,606.62)(3) housepricei $293,309.71 Say we want to put a confidence interval about this prediction. 6-67 An Example of How to Find the Standard Error of the Prediction Create two new variables in Excel by subtracting 2,000 from each square feet observation and 3 from each bedroom observation and then run a regression with price as the dependent variable and the two new variables that were just created as the independent variables. 6-68 Example of Making New Independent Variables: Same Dependent variable 6-69 Excel Regression Results to Find a 95% Prediction Interval for a Mean Value Predicted Value $293,309.71 Note this is the same value we found earlier Standard Error of the Mean Prediction $22,932.24 Confidence interval around the prediction 95% Confidence interval for the mean is ($239,083.58, $347,535.84) 6-70 Excel Regression Results to find a 95% prediction interval for an individual value Predicted Value $293,309.71 Critical Value = 2.36 Standard Error of an Individual Prediction $22,932.24 + 38,917.77 = $61,850.01 Confidence interval around a prediction 95% Prediction interval for an individual is 293,309.71 (2.36)(61,850.01) 293,309.71 145,966.02 (147,343.69, 439,275.73) Notice how much bigger the interval is than before 6-71 How to Test if Two Coefficient Estimates are Equal* Say the original regression model is y i β 0 β1x1,i β 2 x 2,i β 3 x 3,i ε i and you want to test if β1 is equal to β 2 H0: β1 = β2 or β1 - β2 = 0 H1: β1 ≠ β2 or β1 - β2 ≠ 0 This is a t-test and sˆ ˆ is difficult to obtain in Excel. ˆ ˆ 0 1 2 t statistic 1 2 sˆ ˆ 1 2 *Note: This material is not covered in the textbook 6-72 How to Obtain sˆ ˆ 1 2 in Excel (1) Set β1 - β2 = θ and solve for β1 or β1 = β2 + θ (2) Substitute β2 + θ for β1 in the regression model and isolate the parameters y i β 0 β1x1,i β 2 x 2,i β 3 x 3,i ε i y i β 0 (β2 + θ)x 1,i β 2 x 2,i β 3 x 3,i ε i y i β 0 θ x1,i β 2 (x 1,i x 2,i ) β 3 x 3,i ε i (3) Create a new variable (x1,i + x2,i) and regress y on x1,i , (x1,i + x2,i) and x3,i. The t-test and the pvalue for the t-test is in the row with x1,i. 6-73 Original Regression houseprice i 89,2383.27 32,416.38bedroomsi 4257.54bathroomsi 61.61sqfeeti Say we want to test if the coefficients bedrooms and bathrooms are equal Point Estimate of 32,416.39 – (-4,257.54)= $36,673.93 6-74 Create a New Variable (Bedrooms + Bathrooms) Dependent variable 6-75 Excel Regression Results to Find a 95% Prediction Interval for a Mean Value Point Estimate ˆ1 ˆ2 = $36,673.93 Standard Error of ˆ1 ˆ2 = $67,466.01 t-stat for this test is 0.5436 p – value = 0.6063 We fail to reject H0 and conclude β1 = β2 6-76