In this post, we present helpful tools for approaching a large dataset for analysis, as well as ways to present your findings once you’ve finished. In particular, we will introduce the DataExplorer and stargazer packages for EDA and presentation respectively, applied to the House Prices: Advanced Regression Techniques dataset found on Kaggle. For additional tips on how to approach analyzing and visualizing a new dataset, see my previous posts on general steps for analysis and interactive visualization in R. If you are interested in DataFest specific tips as well, view this presentation created for Undergraduate Researchers Interested in Data on our Github page

First Steps

First we load relevant libraries and read in the data, then check the structure of it to determine variable types and dimensions.

library(DataExplorer)
library(tidyverse)

house <- read.csv("C:/Users/Evan/Downloads/train.csv")
#check structure of data
glimpse(house)
## Observations: 1,460
## Variables: 81
## $ Id            <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1...
## $ MSSubClass    <int> 60, 20, 60, 70, 60, 50, 20, 60, 50, 190, 20, 60,...
## $ MSZoning      <fctr> RL, RL, RL, RL, RL, RL, RL, RL, RM, RL, RL, RL,...
## $ LotFrontage   <int> 65, 80, 68, 60, 84, 85, 75, NA, 51, 50, 70, 85, ...
## $ LotArea       <int> 8450, 9600, 11250, 9550, 14260, 14115, 10084, 10...
## $ Street        <fctr> Pave, Pave, Pave, Pave, Pave, Pave, Pave, Pave,...
## $ Alley         <fctr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ LotShape      <fctr> Reg, Reg, IR1, IR1, IR1, IR1, Reg, IR1, Reg, Re...
## $ LandContour   <fctr> Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, Lv...
## $ Utilities     <fctr> AllPub, AllPub, AllPub, AllPub, AllPub, AllPub,...
## $ LotConfig     <fctr> Inside, FR2, Inside, Corner, FR2, Inside, Insid...
## $ LandSlope     <fctr> Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gt...
## $ Neighborhood  <fctr> CollgCr, Veenker, CollgCr, Crawfor, NoRidge, Mi...
## $ Condition1    <fctr> Norm, Feedr, Norm, Norm, Norm, Norm, Norm, PosN...
## $ Condition2    <fctr> Norm, Norm, Norm, Norm, Norm, Norm, Norm, Norm,...
## $ BldgType      <fctr> 1Fam, 1Fam, 1Fam, 1Fam, 1Fam, 1Fam, 1Fam, 1Fam,...
## $ HouseStyle    <fctr> 2Story, 1Story, 2Story, 2Story, 2Story, 1.5Fin,...
## $ OverallQual   <int> 7, 6, 7, 7, 8, 5, 8, 7, 7, 5, 5, 9, 5, 7, 6, 7, ...
## $ OverallCond   <int> 5, 8, 5, 5, 5, 5, 5, 6, 5, 6, 5, 5, 6, 5, 5, 8, ...
## $ YearBuilt     <int> 2003, 1976, 2001, 1915, 2000, 1993, 2004, 1973, ...
## $ YearRemodAdd  <int> 2003, 1976, 2002, 1970, 2000, 1995, 2005, 1973, ...
## $ RoofStyle     <fctr> Gable, Gable, Gable, Gable, Gable, Gable, Gable...
## $ RoofMatl      <fctr> CompShg, CompShg, CompShg, CompShg, CompShg, Co...
## $ Exterior1st   <fctr> VinylSd, MetalSd, VinylSd, Wd Sdng, VinylSd, Vi...
## $ Exterior2nd   <fctr> VinylSd, MetalSd, VinylSd, Wd Shng, VinylSd, Vi...
## $ MasVnrType    <fctr> BrkFace, None, BrkFace, None, BrkFace, None, St...
## $ MasVnrArea    <int> 196, 0, 162, 0, 350, 0, 186, 240, 0, 0, 0, 286, ...
## $ ExterQual     <fctr> Gd, TA, Gd, TA, Gd, TA, Gd, TA, TA, TA, TA, Ex,...
## $ ExterCond     <fctr> TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA,...
## $ Foundation    <fctr> PConc, CBlock, PConc, BrkTil, PConc, Wood, PCon...
## $ BsmtQual      <fctr> Gd, Gd, Gd, TA, Gd, Gd, Ex, Gd, TA, TA, TA, Ex,...
## $ BsmtCond      <fctr> TA, TA, TA, Gd, TA, TA, TA, TA, TA, TA, TA, TA,...
## $ BsmtExposure  <fctr> No, Gd, Mn, No, Av, No, Av, Mn, No, No, No, No,...
## $ BsmtFinType1  <fctr> GLQ, ALQ, GLQ, ALQ, GLQ, GLQ, GLQ, ALQ, Unf, GL...
## $ BsmtFinSF1    <int> 706, 978, 486, 216, 655, 732, 1369, 859, 0, 851,...
## $ BsmtFinType2  <fctr> Unf, Unf, Unf, Unf, Unf, Unf, Unf, BLQ, Unf, Un...
## $ BsmtFinSF2    <int> 0, 0, 0, 0, 0, 0, 0, 32, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ BsmtUnfSF     <int> 150, 284, 434, 540, 490, 64, 317, 216, 952, 140,...
## $ TotalBsmtSF   <int> 856, 1262, 920, 756, 1145, 796, 1686, 1107, 952,...
## $ Heating       <fctr> GasA, GasA, GasA, GasA, GasA, GasA, GasA, GasA,...
## $ HeatingQC     <fctr> Ex, Ex, Ex, Gd, Ex, Ex, Ex, Ex, Gd, Ex, Ex, Ex,...
## $ CentralAir    <fctr> Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y,...
## $ Electrical    <fctr> SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SBrkr...
## $ X1stFlrSF     <int> 856, 1262, 920, 961, 1145, 796, 1694, 1107, 1022...
## $ X2ndFlrSF     <int> 854, 0, 866, 756, 1053, 566, 0, 983, 752, 0, 0, ...
## $ LowQualFinSF  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ GrLivArea     <int> 1710, 1262, 1786, 1717, 2198, 1362, 1694, 2090, ...
## $ BsmtFullBath  <int> 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, ...
## $ BsmtHalfBath  <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ FullBath      <int> 2, 2, 2, 1, 2, 1, 2, 2, 2, 1, 1, 3, 1, 2, 1, 1, ...
## $ HalfBath      <int> 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, ...
## $ BedroomAbvGr  <int> 3, 3, 3, 3, 4, 1, 3, 3, 2, 2, 3, 4, 2, 3, 2, 2, ...
## $ KitchenAbvGr  <int> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, ...
## $ KitchenQual   <fctr> Gd, TA, Gd, Gd, Gd, TA, Gd, TA, TA, TA, TA, Ex,...
## $ TotRmsAbvGrd  <int> 8, 6, 6, 7, 9, 5, 7, 7, 8, 5, 5, 11, 4, 7, 5, 5,...
## $ Functional    <fctr> Typ, Typ, Typ, Typ, Typ, Typ, Typ, Typ, Min1, T...
## $ Fireplaces    <int> 0, 1, 1, 1, 1, 0, 1, 2, 2, 2, 0, 2, 0, 1, 1, 0, ...
## $ FireplaceQu   <fctr> NA, TA, TA, Gd, TA, NA, Gd, TA, TA, TA, NA, Gd,...
## $ GarageType    <fctr> Attchd, Attchd, Attchd, Detchd, Attchd, Attchd,...
## $ GarageYrBlt   <int> 2003, 1976, 2001, 1998, 2000, 1993, 2004, 1973, ...
## $ GarageFinish  <fctr> RFn, RFn, RFn, Unf, RFn, Unf, RFn, RFn, Unf, RF...
## $ GarageCars    <int> 2, 2, 2, 3, 3, 2, 2, 2, 2, 1, 1, 3, 1, 3, 1, 2, ...
## $ GarageArea    <int> 548, 460, 608, 642, 836, 480, 636, 484, 468, 205...
## $ GarageQual    <fctr> TA, TA, TA, TA, TA, TA, TA, TA, Fa, Gd, TA, TA,...
## $ GarageCond    <fctr> TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA,...
## $ PavedDrive    <fctr> Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y,...
## $ WoodDeckSF    <int> 0, 298, 0, 0, 192, 40, 255, 235, 90, 0, 0, 147, ...
## $ OpenPorchSF   <int> 61, 0, 42, 35, 84, 30, 57, 204, 0, 4, 0, 21, 0, ...
## $ EnclosedPorch <int> 0, 0, 0, 272, 0, 0, 0, 228, 205, 0, 0, 0, 0, 0, ...
## $ X3SsnPorch    <int> 0, 0, 0, 0, 0, 320, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ ScreenPorch   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 176, 0, 0, 0...
## $ PoolArea      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ PoolQC        <fctr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ Fence         <fctr> NA, NA, NA, NA, NA, MnPrv, NA, NA, NA, NA, NA, ...
## $ MiscFeature   <fctr> NA, NA, NA, NA, NA, Shed, NA, Shed, NA, NA, NA,...
## $ MiscVal       <int> 0, 0, 0, 0, 0, 700, 0, 350, 0, 0, 0, 0, 0, 0, 0,...
## $ MoSold        <int> 2, 5, 9, 2, 12, 10, 8, 11, 4, 1, 2, 7, 9, 8, 5, ...
## $ YrSold        <int> 2008, 2007, 2008, 2006, 2008, 2009, 2007, 2009, ...
## $ SaleType      <fctr> WD, WD, WD, WD, WD, WD, WD, WD, WD, WD, WD, New...
## $ SaleCondition <fctr> Normal, Normal, Normal, Abnorml, Normal, Normal...
## $ SalePrice     <int> 208500, 181500, 223500, 140000, 250000, 143000, ...

DataExplorer / EDA

Using DataExplorer we can easily check for missing values within the dataset, as well as plot both discrete and continuous variables to get a sense of their distributions at a glance.

#check for missing values
plot_missing(house)

#plot discrete variables
plot_bar(house)

#plot continuous variables
plot_histogram(house)

We can also create correlation matrices based on discrete and continuous values to see the relation of variables to each other, and particularly the relationships between the response variable SalePrice and others.

plot_correlation(house, type="continuous")

Be aware that with large datasets like this one, it can be easy to create overwhelming looking plots that don’t give much information, so subset the data to only include hypothesized variables of interest first.

plot_correlation(house, type="discrete")

Once missing values are identified, they can also be easily replaced, which could be preferable to simply removing those rows with NAs in some situations. The following commands require your data to first be converted into a data table before executing them.

library(data.table)
house.dt <- data.table(house)
set_missing(house.dt, list(0L, "unknown"))#replaces continuous with 0, factor with "unknown")
plot_bar(house.dt$PoolQC)

For factors with many levels, but sparse categories among many of them, we can also group the sparse categories into a single value to focus more heavily on the more represented values of interest.

plot_bar(house.dt$Exterior1st) #without grouping

group_category(data = house.dt, feature = "Exterior1st", 
  threshold = 0.2, update = TRUE)#set update to TRUE for changes to reflect in original data
plot_bar(house.dt$Exterior1st)

We can also knit an html document containing many of the summary statistics generated previously using only a signel line of code - very helpful!

create_report(house)

Read more about DataExplorer and its functionality here.

Stargazer / Presenting

Stargazer is an R package that creates LaTeX code from your tables, making them look very presentable in a paper or presentation about your analysis. In this example, we group houses into bins of low, medium, and high prices, then output summary statistics about each category using existing variables in the dataset, as well as new ones created ourselves. To use these tables, take the code outputted from the stargazer() function and put it into your RMarkdown file as a text chunk.

library(stargazer)
my_table = house %>%
  mutate(price_cat = cut(SalePrice, breaks = c(0, 100000, 250000, Inf),
            labels = c("low", "medium","high")),
  num_bathrm = BsmtFullBath + BsmtHalfBath + FullBath + HalfBath) %>% 
  group_by(price_cat) %>% 
  summarise(avg_area = mean(LotArea), avg_qual = mean(OverallQual), 
            avg_cond = mean(OverallCond), avg_bath = mean(num_bathrm),
            avg_yearsold = mean(YrSold), avg_price = mean(SalePrice)) %>% 
  data.frame() #stargazer doesn't play nice with dplyr's tbl_df
stargazer(my_table, summary=F, type="html", title="Housing Summaries by Price")
Housing Summaries by Price
price_cat avg_area avg_qual avg_cond avg_bath avg_yearsold avg_price
1 low 7,110.350 4.398 4.992 1.431 2,008.008 82,527.300
2 medium 9,932.099 5.907 5.682 2.376 2,007.805 163,453.400
3 high 15,465.650 8.055 5.355 3.281 2,007.760 326,849.300

stargazer can display the outputs of model summaries as well! We create a linear model using a few features identified as being strongly correlated with housing price and acheive an R^2 value of .75 using only a few features of the dataset.

house.mod <- house %>% 
  select(SalePrice, OverallQual, YearBuilt, GrLivArea, FullBath, GarageCars)
  
mod <- lm(SalePrice~., house.mod)
stargazer(mod, title = "Linear Regression Model Summary", type="html")
Linear Regression Model Summary
Dependent variable:
SalePrice
OverallQual 23,025.430***
(1,170.602)
YearBuilt 420.344***
(48.593)
GrLivArea 61.631***
(3.011)
FullBath -9,171.480***
(2,719.325)
GarageCars 16,272.200***
(1,879.780)
Constant -895,928.800***
(92,293.480)
Observations 1,460
R2 0.752
Adjusted R2 0.751
Residual Std. Error 39,640.010 (df = 1454)
F Statistic 881.189*** (df = 5; 1454)
Note: p<0.1; p<0.05; p<0.01