Logo

Tim DiPalo

LinkedIn
Resume
GitHub

import pandas as pd
from statsmodels.formula.api import ols as sm_ols
import numpy as np
import seaborn as sns
from statsmodels.iolib.summary2 import summary_col # nicer tables

Part 1: EDA

Insert cells as needed below to write a short EDA/data section that summarizes the data for someone who has never opened it before.

housing = pd.read_csv('input_data2/housing_train.csv')
housing
parcel v_MS_SubClass v_MS_Zoning v_Lot_Frontage v_Lot_Area v_Street v_Alley v_Lot_Shape v_Land_Contour v_Utilities ... v_Pool_Area v_Pool_QC v_Fence v_Misc_Feature v_Misc_Val v_Mo_Sold v_Yr_Sold v_Sale_Type v_Sale_Condition v_SalePrice
0 1056_528110080 20 RL 107.0 13891 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 1 2008 New Partial 372402
1 1055_528108150 20 RL 98.0 12704 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 1 2008 New Partial 317500
2 1053_528104050 20 RL 114.0 14803 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 6 2008 New Partial 385000
3 2213_909275160 20 RL 126.0 13108 Pave NaN IR2 HLS AllPub ... 0 NaN NaN NaN 0 6 2007 WD Normal 153500
4 1051_528102030 20 RL 96.0 12444 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 11 2008 New Partial 394617
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1936 2524_534125210 190 RL 79.0 13110 Pave NaN IR1 Lvl AllPub ... 0 NaN MnPrv NaN 0 7 2006 WD Normal 146500
1937 2846_909131125 190 RH NaN 7082 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 7 2006 WD Normal 160000
1938 2605_535382020 190 RL 60.0 10800 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 5 2006 ConLD Normal 160000
1939 1516_909101180 190 RL 55.0 5687 Pave Grvl Reg Bnk AllPub ... 0 NaN NaN NaN 0 3 2008 WD Normal 135900
1940 1387_905200100 190 RL 60.0 12900 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 1 2008 WD Alloca 95541

1941 rows × 81 columns

Essential Questions:

The unit of observation is a house/property

The time period is the years 2006 to 2008, with a fairly even distribution of observations between those years

The parcel variable is a unique identifier for this dataset

The dataset has a sample size of 1941 observations

Many of the variables in this dataset are categorical, mostly describing different types of property or of certain features. For example:

Some variables in the dataset are continuous or discrete, with many showing the square footage of certain features of the property. For example:

Potential Issues

Some variables have large amounts of missing data, however I believe these often just indicate a property that doesn’t have whatever that variable is talking about. For example, v_Fence has 1,576 missing values, but when looking at the documentation, it is clear that properties without a fence are filled with NaN values

I was able to identify some outliers using scatterplots for v_SalePrice with one of the variables we will be using in the regressions, v_Lot_Area. This plot can be seen below.

The box plot showing the sales price by years also displays these outliers above the maximum sale price. This plot can be seen below

sns.scatterplot(x='v_SalePrice', y='v_Lot_Area', data=housing)
<AxesSubplot: xlabel='v_SalePrice', ylabel='v_Lot_Area'>

png

sns.boxplot(x='v_Yr_Sold', y='v_SalePrice', data=housing)
<AxesSubplot: xlabel='v_Yr_Sold', ylabel='v_SalePrice'>

png

Data Visualization Takeaways

Continuous variables: Sales price and Lot Area - Seen above, this relationship seems linear, with some outliers

Categorical variables: Sales price and Year Sold - Seen above, this relationship also shows outliers in the sales price

Part 2: Running Regressions

Run these regressions on the RAW data, even if you found data issues that you think should be addressed.

Insert cells as needed below to run these regressions. Note that $i$ is indexing a given house, and $t$ indexes the year of sale.

  1. $\text{Sale Price}_{i,t} = \alpha + \beta_1 * \text{v_Lot_Area}$
  2. $\text{Sale Price}_{i,t} = \alpha + \beta_1 * log(\text{v_Lot_Area})$
  3. $log(\text{Sale Price}_{i,t}) = \alpha + \beta_1 * \text{v_Lot_Area}$
  4. $log(\text{Sale Price}_{i,t}) = \alpha + \beta_1 * log(\text{v_Lot_Area})$
  5. $log(\text{Sale Price}_{i,t}) = \alpha + \beta_1 * \text{v_Yr_Sold}$
  6. $log(\text{Sale Price}_{i,t}) = \alpha + \beta_1 * (\text{v_Yr_Sold==2007})+ \beta_2 * (\text{v_Yr_Sold==2008})$
  7. Choose your own adventure: Pick any five variables from the dataset that you think will generate good R2. Use them in a regression of $log(\text{Sale Price}_{i,t})$
    • Tip: You can transform/create these five variables however you want, even if it creates extra variables. For example: I’d count Model 6 above as only using one variable: v_Yr_Sold.
    • I got an R2 of 0.877 with just “5” variables. How close can you get? I won’t be shocked if someone beats that!

Bonus formatting trick: Instead of reporting all regressions separately, report all seven regressions in a single table using summary_col.

reg1 = sm_ols(
    "v_SalePrice ~ v_Lot_Area", data=housing
).fit()
reg1.summary()
OLS Regression Results
Dep. Variable: v_SalePrice R-squared: 0.067
Model: OLS Adj. R-squared: 0.066
Method: Least Squares F-statistic: 138.3
Date: Sun, 02 Apr 2023 Prob (F-statistic): 6.82e-31
Time: 16:13:19 Log-Likelihood: -24610.
No. Observations: 1941 AIC: 4.922e+04
Df Residuals: 1939 BIC: 4.924e+04
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 1.548e+05 2911.591 53.163 0.000 1.49e+05 1.6e+05
v_Lot_Area 2.6489 0.225 11.760 0.000 2.207 3.091
Omnibus: 668.513 Durbin-Watson: 1.064
Prob(Omnibus): 0.000 Jarque-Bera (JB): 3001.894
Skew: 1.595 Prob(JB): 0.00
Kurtosis: 8.191 Cond. No. 2.13e+04



Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.13e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

reg2 = sm_ols(
    "v_SalePrice ~ np.log(v_Lot_Area)", data=housing
).fit()
reg2.summary()
OLS Regression Results
Dep. Variable: v_SalePrice R-squared: 0.128
Model: OLS Adj. R-squared: 0.128
Method: Least Squares F-statistic: 285.6
Date: Sun, 02 Apr 2023 Prob (F-statistic): 6.95e-60
Time: 16:13:19 Log-Likelihood: -24544.
No. Observations: 1941 AIC: 4.909e+04
Df Residuals: 1939 BIC: 4.910e+04
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept -3.279e+05 3.02e+04 -10.850 0.000 -3.87e+05 -2.69e+05
np.log(v_Lot_Area) 5.603e+04 3315.139 16.901 0.000 4.95e+04 6.25e+04
Omnibus: 650.067 Durbin-Watson: 1.042
Prob(Omnibus): 0.000 Jarque-Bera (JB): 2623.687
Skew: 1.587 Prob(JB): 0.00
Kurtosis: 7.729 Cond. No. 164.



Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

reg3 = sm_ols(
    "np.log(v_SalePrice) ~ v_Lot_Area", data=housing
).fit()
reg3.summary()
OLS Regression Results
Dep. Variable: np.log(v_SalePrice) R-squared: 0.065
Model: OLS Adj. R-squared: 0.064
Method: Least Squares F-statistic: 133.9
Date: Sun, 02 Apr 2023 Prob (F-statistic): 5.46e-30
Time: 16:13:19 Log-Likelihood: -927.19
No. Observations: 1941 AIC: 1858.
Df Residuals: 1939 BIC: 1870.
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 11.8941 0.015 813.211 0.000 11.865 11.923
v_Lot_Area 1.309e-05 1.13e-06 11.571 0.000 1.09e-05 1.53e-05
Omnibus: 75.460 Durbin-Watson: 0.980
Prob(Omnibus): 0.000 Jarque-Bera (JB): 218.556
Skew: -0.066 Prob(JB): 3.48e-48
Kurtosis: 4.639 Cond. No. 2.13e+04



Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.13e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

reg4 = sm_ols(
    "np.log(v_SalePrice) ~ np.log(v_Lot_Area)", data=housing
).fit()
reg4.summary()
OLS Regression Results
Dep. Variable: np.log(v_SalePrice) R-squared: 0.135
Model: OLS Adj. R-squared: 0.135
Method: Least Squares F-statistic: 302.5
Date: Sun, 02 Apr 2023 Prob (F-statistic): 4.38e-63
Time: 16:13:19 Log-Likelihood: -851.27
No. Observations: 1941 AIC: 1707.
Df Residuals: 1939 BIC: 1718.
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 9.4051 0.151 62.253 0.000 9.109 9.701
np.log(v_Lot_Area) 0.2883 0.017 17.394 0.000 0.256 0.321
Omnibus: 84.067 Durbin-Watson: 0.955
Prob(Omnibus): 0.000 Jarque-Bera (JB): 255.283
Skew: -0.100 Prob(JB): 3.68e-56
Kurtosis: 4.765 Cond. No. 164.



Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

reg5 = sm_ols(
    "np.log(v_SalePrice) ~ v_Yr_Sold", data=housing
).fit()
reg5.summary()
OLS Regression Results
Dep. Variable: np.log(v_SalePrice) R-squared: 0.000
Model: OLS Adj. R-squared: -0.000
Method: Least Squares F-statistic: 0.2003
Date: Sun, 02 Apr 2023 Prob (F-statistic): 0.655
Time: 16:13:19 Log-Likelihood: -991.88
No. Observations: 1941 AIC: 1988.
Df Residuals: 1939 BIC: 1999.
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 22.2932 22.937 0.972 0.331 -22.690 67.277
v_Yr_Sold -0.0051 0.011 -0.448 0.655 -0.028 0.017
Omnibus: 55.641 Durbin-Watson: 0.985
Prob(Omnibus): 0.000 Jarque-Bera (JB): 131.833
Skew: 0.075 Prob(JB): 2.36e-29
Kurtosis: 4.268 Cond. No. 5.03e+06



Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 5.03e+06. This might indicate that there are
strong multicollinearity or other numerical problems.

reg6 = sm_ols(
    "np.log(v_SalePrice) ~ (v_Yr_Sold==2007) + (v_Yr_Sold==2008)", data=housing
).fit()
reg6.summary()
OLS Regression Results
Dep. Variable: np.log(v_SalePrice) R-squared: 0.001
Model: OLS Adj. R-squared: 0.000
Method: Least Squares F-statistic: 1.394
Date: Sun, 02 Apr 2023 Prob (F-statistic): 0.248
Time: 16:13:19 Log-Likelihood: -990.59
No. Observations: 1941 AIC: 1987.
Df Residuals: 1938 BIC: 2004.
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 12.0229 0.016 745.087 0.000 11.991 12.055
v_Yr_Sold == 2007[T.True] 0.0256 0.022 1.150 0.250 -0.018 0.069
v_Yr_Sold == 2008[T.True] -0.0103 0.023 -0.450 0.653 -0.055 0.035
Omnibus: 54.618 Durbin-Watson: 0.989
Prob(Omnibus): 0.000 Jarque-Bera (JB): 127.342
Skew: 0.080 Prob(JB): 2.23e-28
Kurtosis: 4.245 Cond. No. 3.79



Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

reg7 = sm_ols(
    "np.log(v_SalePrice) ~ v_Neighborhood + v_MS_SubClass + v_Low_Qual_Fin_SF + v_Fireplaces + v_Year_Built", data=housing
).fit()
reg7.summary()
OLS Regression Results
Dep. Variable: np.log(v_SalePrice) R-squared: 0.684
Model: OLS Adj. R-squared: 0.679
Method: Least Squares F-statistic: 133.1
Date: Sun, 02 Apr 2023 Prob (F-statistic): 0.00
Time: 16:13:19 Log-Likelihood: 125.02
No. Observations: 1941 AIC: -186.0
Df Residuals: 1909 BIC: -7.779
Df Model: 31
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 4.5093 0.739 6.104 0.000 3.060 5.958
v_Neighborhood[T.Blueste] -0.2082 0.126 -1.656 0.098 -0.455 0.038
v_Neighborhood[T.BrDale] -0.3727 0.073 -5.087 0.000 -0.516 -0.229
v_Neighborhood[T.BrkSide] -0.2105 0.064 -3.285 0.001 -0.336 -0.085
v_Neighborhood[T.ClearCr] 0.0662 0.068 0.975 0.330 -0.067 0.199
v_Neighborhood[T.CollgCr] 0.0531 0.055 0.969 0.333 -0.054 0.160
v_Neighborhood[T.Crawfor] 0.1379 0.062 2.227 0.026 0.016 0.259
v_Neighborhood[T.Edwards] -0.2070 0.058 -3.557 0.000 -0.321 -0.093
v_Neighborhood[T.Gilbert] -0.0473 0.056 -0.843 0.399 -0.157 0.063
v_Neighborhood[T.Greens] -0.0720 0.115 -0.626 0.531 -0.297 0.153
v_Neighborhood[T.GrnHill] 0.4378 0.170 2.578 0.010 0.105 0.771
v_Neighborhood[T.IDOTRR] -0.3038 0.065 -4.688 0.000 -0.431 -0.177
v_Neighborhood[T.Landmrk] -0.1571 0.235 -0.669 0.503 -0.617 0.303
v_Neighborhood[T.MeadowV] -0.4835 0.070 -6.879 0.000 -0.621 -0.346
v_Neighborhood[T.Mitchel] -0.1097 0.059 -1.858 0.063 -0.226 0.006
v_Neighborhood[T.NAmes] -0.1327 0.056 -2.358 0.018 -0.243 -0.022
v_Neighborhood[T.NPkVill] -0.1467 0.093 -1.584 0.113 -0.328 0.035
v_Neighborhood[T.NWAmes] 0.0283 0.058 0.484 0.629 -0.086 0.143
v_Neighborhood[T.NoRidge] 0.4514 0.061 7.350 0.000 0.331 0.572
v_Neighborhood[T.NridgHt] 0.3878 0.056 6.917 0.000 0.278 0.498
v_Neighborhood[T.OldTown] -0.1142 0.062 -1.827 0.068 -0.237 0.008
v_Neighborhood[T.SWISU] -0.0695 0.073 -0.948 0.343 -0.213 0.074
v_Neighborhood[T.Sawyer] -0.1656 0.059 -2.823 0.005 -0.281 -0.051
v_Neighborhood[T.SawyerW] 0.0197 0.059 0.334 0.739 -0.096 0.136
v_Neighborhood[T.Somerst] 0.1561 0.055 2.814 0.005 0.047 0.265
v_Neighborhood[T.StoneBr] 0.4416 0.064 6.945 0.000 0.317 0.566
v_Neighborhood[T.Timber] 0.1678 0.061 2.735 0.006 0.047 0.288
v_Neighborhood[T.Veenker] 0.1494 0.073 2.042 0.041 0.006 0.293
v_MS_SubClass -0.0003 0.000 -2.180 0.029 -0.001 -3.08e-05
v_Low_Qual_Fin_SF 0.0003 0.000 2.086 0.037 1.57e-05 0.001
v_Fireplaces 0.1905 0.009 21.120 0.000 0.173 0.208
v_Year_Built 0.0038 0.000 10.269 0.000 0.003 0.004
Omnibus: 298.259 Durbin-Watson: 1.671
Prob(Omnibus): 0.000 Jarque-Bera (JB): 2429.518
Skew: -0.467 Prob(JB): 0.00
Kurtosis: 8.401 Cond. No. 2.83e+05



Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.83e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

# Print combined regression table (copied from class notes)

# now I'll format an output table
# I'd like to include extra info in the table (not just coefficients)
info_dict={'R-squared' : lambda x: f"{x.rsquared:.2f}",
           'Adj R-squared' : lambda x: f"{x.rsquared_adj:.2f}",
           'No. observations' : lambda x: f"{int(x.nobs):d}"}

# q4b1 and q4b2 name the dummies differently in the table, so this is a silly fix
# reg4.model.exog_names[1:] = reg5.model.exog_names[1:]  

# This summary col function combines a bunch of regressions into one nice table
print('='*108)
# print('                  y = interest rate if not specified, log(interest rate else)')
print(summary_col(results=[reg1,reg2,reg3,reg4,reg5,reg6], # list the result obj here
                  float_format='%0.2f',
                  stars = True, # stars are easy way to see if anything is statistically significant
                  model_names=['1','2',' 3 ','4 ','5','6'], # these are bad names, lol. Usually, just use the y variable name
                  info_dict=info_dict,
                  # regressor_order=[ 'Intercept','Borrower_Credit_Score_at_Origination','l_credscore','l_LTV','l_credscore:l_LTV',
                  #                 'C(creditbins)[Very Poor]','C(creditbins)[Fair]','C(creditbins)[Good]','C(creditbins)[Vrey Good]','C(creditbins)[Exceptional]']
                  )
     )
============================================================================================================

======================================================================================
                               1             2          3        4       5       6    
--------------------------------------------------------------------------------------
Intercept                 154789.55*** -327915.80*** 11.89*** 9.41*** 22.29   12.02***
                          (2911.59)    (30221.35)    (0.01)   (0.15)  (22.94) (0.02)  
R-squared                 0.07         0.13          0.06     0.13    0.00    0.00    
R-squared Adj.            0.07         0.13          0.06     0.13    -0.00   0.00    
np.log(v_Lot_Area)                     56028.17***            0.29***                 
                                       (3315.14)              (0.02)                  
v_Lot_Area                2.65***                    0.00***                          
                          (0.23)                     (0.00)                           
v_Yr_Sold                                                             -0.01           
                                                                      (0.01)          
v_Yr_Sold == 2007[T.True]                                                     0.03    
                                                                              (0.02)  
v_Yr_Sold == 2008[T.True]                                                     -0.01   
                                                                              (0.02)  
R-squared                 0.07         0.13          0.06     0.13    0.00    0.00    
Adj R-squared             0.07         0.13          0.06     0.13    -0.00   0.00    
No. observations          1941         1941          1941     1941    1941    1941    
======================================================================================
Standard errors in parentheses.
* p<.1, ** p<.05, ***p<.01

Part 3: Regression interpretation

Insert cells as needed below to answer these questions. Note that $i$ is indexing a given house, and $t$ indexes the year of sale.

  1. If you didn’t use the summary_col trick, list $\beta_1$ for Models 1-6 to make it easier on your graders.
  2. Interpret $\beta_1$ in Model 2.
  3. Interpret $\beta_1$ in Model 3.
    • HINT: You might need to print out more decimal places. Show at least 2 non-zero digits.
  4. Of models 1-4, which do you think best explains the data and why?
  5. Interpret $\beta_1$ In Model 5
  6. Interpret $\alpha$ in Model 6
  7. Interpret $\beta_1$ in Model 6
  8. Why is the R2 of Model 6 higher than the R2 of Model 5?
  9. What variables did you include in Model 7?
  10. What is the R2 of your Model 7?
  11. Speculate (not graded): Could you use the specification of Model 6 in a predictive regression?
  12. Speculate (not graded): Could you use the specification of Model 5 in a predictive regression?

Question 1:

Printed summary table above

Question 2:

Question 3:

Question 4:

I think that model 4 best explains the data becaus it has the highest r2 and I think using the log function on both the sale price and lot area makes the most sense for this model. Since model 4 has the highest r2 out of thise models, that means that the model can explain more of the variation in the sale price compared to the other models. Additionally, because of the EDA I preformed above, I think taking the log of both variables makes the most sense. This helps to normalize the distribution, reducing the effect of the outliers that I identified in part 1.

Question 5:

When year sold goes up by 1, the sale price of the home decreases by 0.51% holding all other X constant

Question 6:

Question 7:

Question 8:

The r2 of model 6 is higher than the r2 of model 5 because there is more correlation between the sales price and the sales year when the year is 2007 or 2008. The model explains the variation in sales price better when the year 2006 is omitted.

Question 9:

The variables I included were:

Question 10:

0.684

Question 11:

Model 6 is just comparing the difference in sales prices between the year 2007 and 2008. Since this model is specific to those years, I don’t think it could be used in a predictive regression. Additionally, the model could be somewhat irregular due to the housing crisis in 2008, so I think that is another reason it wouldn’t be good to use as a predictive regression.

Question 12:

Model 5 is showing the relationship between the sales price and year sold in general, so I think you could use it for a predictive regression, but it’s not Ideal. It would definitely be better than model 6, but it only takes into account 3 years, one of those being 2008 which could cause irregularities in the model due to the housing crisis. Ideally, a model that takes into account more years of data would be better for a predictive regression.