library('tidyverse') # Wrapper for many convenient libraries
library('modeldata') # Contains data for the assignment
library('skimr') # Quickly summarise data
library('gt') # Render "great tables"
library('leaflet') # Make interactive maps
library('glmnet') # Loading penalized regression library
library('rsample') # Easy sampling and splitting
library('corrplot') # Quick correlation plotsPenalized Regression | Sacramento Housing Data
Assignment Questions
Name
What is your name? Include all team members if submitting as a group.
Adam Bushman [u6049169]; no group members.
Perspective
From what perspective ar you conducting the analysis? (Who are you? / Who are you working for?)
I am a real estate broker, working with clients eager to invest in residential real estate. These clients want their capital to stretch as far as possible; said differently, they’re looking for good deals (highest value per dollar spent) on each property. My job as the broker is to identify high value properties and guide them to submitting competitive yet high ROI offers.
Question
What is your question?
My newest client is looking for properties in the Sacramento, CA market. I’m far less familiar with Sacramento, as I normally operate in the Bay Area (San Francisco, Oakland, etc). Therefore, I must rely on a more data-driven approach to finding high value properties.
Can I develop a solid predictive model for housing prices in the Sacramento area, where a delta in the predicted price and listed sales price may indicate potential value?
Data set
Describe your dataset(s) including URL (if available)
Data sourced from Posit via their {modeldata} package, accessed October 4th, 2024. The data was originally sourced from Insurity Spacial Key. Their original description of the datais as follows:
The Sacramento real estate transactions file is a list of 985 real estate transactions in the Sacramento area reported over a five-day period, as reported by the Sacramento Bee.
The data set includes attributes relative to location of the property and common specification measures.
Predictor(s) and target(s)
What is (are) your independent variable(s) and dependent variable(s)? Include variable type (binary, categorical, numeric).
The dependent variable (target) natural to this situation is price. Because the client wants to find properties whose worth exceeds that of the sales price, this is the right choice. The independent variables (predictors) are city, zip, beds, baths, sqft, type, latitude, and longitude.
For a complete description of each feature and their data types, navigate to the data dictionary.
Model resonability
How are your variables suitable for your analysis method?
The variables chosen above are the extent of the dataset (all features used). The analysis method chosen is a penalized regression elastic net model.
These variables and the analysis method are a natural pairing thanks to:
- Meeting the required assumptions
- Linear relationship between predictors and target
- Normally distributed residuals
- Homoscedasticity of residuals (balanced values)
- Presence of multicollinearity
- Penalized regression resolves multicollinearity thanks to the applied penalty
- Feature reduction
- While this dataset only has 8 predictors, the unique values of factor data types will expand the scope
cityhas 37 unique valuesziphas 68 unique valuestypehas 3 unique values
- While this dataset only has 8 predictors, the unique values of factor data types will expand the scope
Conclusions
What are your conclusions (include references to one or two CLEARLY INDICATED AND IMPORTANT graphs or tables in your output)?
Model effectiveness
A penalized regression model achieved the objective of a) leveraging a data set ill-suited for a traditual Ordinary Least Squares (OLS) regression model and b) achiving a smaller model with nearly the same performance.
Throughout the training workflow, the measurement for Mean Squared Error was found to be ~0.11 while the testing workflow resulted in an MSE of ~0.12.
High value properties
As a broker, the resulting predicitons of price compared to the original “sales” price were a tremendous aid in pinpointing high value properties for the client. Given my lack of familiarity with the Sacramento, CA market, this proved an invaluable first step in a) learning the market and b) accelerating the process.
Navigate to a list of the top-10 high value properties uncovered by the model.
Touring the properties
As a broker, the next step would be to get the client to the properties. We’re already saving time and money by prioritizing properties that best align to the interest of the client. To best prepare for the tours, I would want to get a better sense for absolute and relative location.
Navigate to a map of high value properties, an ideal tool for expediting the process.
Assumptions
What are your assumptions and limitations? Did you include any robustness checks?
Assumptions made throughout the analysis
- Understood that this is a sample of housing prices
- Assumed that these are prices of “historically sold” properties in the area
- Assumed that the client and broker (myself) define “high-value” as a delta between predicted price and sales price
- Understood that “value” may exist in other areas, not just price (i.e. time to closing, offer terms, etc.)
Assignment Workflow
Analysis Prep
Loading packages
Loading the data
We’ll start off by referencing the “Sacramento Housing Data” for the assignment from the {modeldata} package.
sac_raw <- modeldata::Sacramento # Data for the assignmentWith it loaded into the session, let’s get a sense for what we’re working with.
Data set inspection
Right away, I like to get acquainted with the data set. That means understanding what each column seeks to describe, confirming the granularity of the rows, and getting my arms around structure, completeness, distribution, etc.
Posit, the company behind {modeldata}, did not include a data dictionary; however, the variables are sufficiently self-explanatory. Below is a derived data dictionary:
gt(sac_data_dict) %>% # Create a "great tables" (gt) object
cols_label( # Rename some columns
variable = "Variable Name",
datatype = "Data Type",
description = "Variable Description"
) %>%
tab_options(
column_labels.background.color = "#d9230f", # Apply red header background
column_labels.font.weight = "bold", # Bold headers
row.striping.background = '#FFFFFF' # Remove row striping
)| Variable Name | Data Type | Variable Description |
|---|---|---|
| City | <factor> | City/community/suburb of Sacramento, CA |
| Zip | <factor> | Zip code of Sacramento, CA |
| Beds | <integer> | Number of bedrooms |
| Baths | <double> | Number of bathrooms |
| SqFt | <integer> | Square footage of the property |
| Type | <factor> | Type of property |
| Price | <factor> | Sale price of the property |
| Latitude | <double> | Latitude coordinate value |
| Longitude | <double> | Longitude coordinate value |
Using the {skimr} package, we can get a comprehensive summary of the data.
skim(sac_raw)| Name | sac_raw |
| Number of rows | 932 |
| Number of columns | 9 |
| _______________________ | |
| Column type frequency: | |
| factor | 3 |
| numeric | 6 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| city | 0 | 1 | FALSE | 37 | SAC: 438, ELK: 114, ROS: 48, CIT: 35 |
| zip | 0 | 1 | FALSE | 68 | z95: 61, z95: 45, z95: 44, z95: 37 |
| type | 0 | 1 | FALSE | 3 | Res: 866, Con: 53, Mul: 13 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| beds | 0 | 1 | 3.28 | 0.89 | 1.00 | 3.00 | 3.00 | 4.00 | 8.00 | ▃▇▆▁▁ |
| baths | 0 | 1 | 2.05 | 0.72 | 1.00 | 2.00 | 2.00 | 2.00 | 5.00 | ▂▇▂▁▁ |
| sqft | 0 | 1 | 1680.32 | 726.27 | 484.00 | 1166.75 | 1470.00 | 1953.50 | 4878.00 | ▇▇▂▁▁ |
| price | 0 | 1 | 246661.58 | 131126.86 | 30000.00 | 156000.00 | 220000.00 | 305000.00 | 884790.00 | ▇▇▂▁▁ |
| latitude | 0 | 1 | 38.59 | 0.13 | 38.24 | 38.48 | 38.62 | 38.69 | 39.02 | ▁▆▇▂▁ |
| longitude | 0 | 1 | -121.36 | 0.14 | -121.55 | -121.45 | -121.38 | -121.31 | -120.60 | ▇▅▁▁▁ |
Initial observations include:
- With nearly 1K rows, we have a good sample size to work with
- We have no missing values; this means we won’t have to eliminate records, features, or determine the best way to impute values
- We so have some skewed distributions, most long, right tails; we’d likely benefit from some logarithmic transformations
- The categorical/factor variables
cityandzipnaturally have many distinct values. This could very well be where penalized regression shines in feature reduction - Latitude and longitude aren’t variables we’d expect a 1 unit increase/decrease to have some underlying, consistent relationship with price
Simple Exploratory Data Analysis
Multicollinearity
One of the primary problems solved by penalized regression is multicollinearity (predictors that are highly correlated to each other). Let’s generate a correlation matrix for the numeric predictors:
sac_corr <-
sac_raw %>%
select(where(is.numeric)) %>% # Select numeric variables
cor() # Run pearson correlationcorrplot.mixed(sac_corr, order = 'AOE') # Correlation plotUnderstandably, the non-latitude/longitude variables are highly correlated. The more square feet of a home, naturally the price will go up since its cost is directly associated with the size. Similar intuition applies to baths and beds.
Fortunately, penalized regression will handle this without a problem.
It’s important to check other assumptions, namely:
- Linear relationship between each predictor and outcome
- Normally distributed residuals
- Homoscedasticity of residuals
We can check all three of these by running a basic linear model and generating plots therefrom:
test_mod <- lm(price ~ ., data = sac_raw) # Test linear modelLinear relationship & Normally distributed residuals
Because we’re using a parametric, linear test with penalized regression, we need some semblance of a linear relationship. The below plots give enough indication there are linear relationships worth capturing in a model.
plot(sac_raw) # Scatter plotsAdditionally, we need the residuals to approximate the normal distribution. The below plot indicates that we are sufficiently achieving both. There is more deviation from the line on the right tail, which we’ve already pointed out can be mitigated with a logarithmic transform where appropriate.
plot(test_mod, which = 2) # Normally distributed residuals visualizationHomoscedasticity of residuals
We need the residuals to be fairly consistent in their distribution across fitted values. We don’t want to be seeing a cone shape, where the distribution is high on one end and narrow on the other. The below plot shows pretty consistent distribution of points along the y axis for all the values on the x axis.
plot(test_mod, which = 3) # Homoscedasticity of residuals visualizationPreprocessing
Data cleaning / Feature Engineering
The data set is already complete and nicely formatted. Little, if anything, is needed on this front. We’ve mentioned previously that two variables, in particular looked to have right-skewed distributions: price and sqft. Let’s take a peek at these for confirmation.
First, we’ve alter the data for a long format (will make it easier to plot).
sac_alt <-
sac_raw %>% # Use the raw data
select(price, sqft) %>% # Select only price, sqft
pivot_longer( # Pivot to long format
cols = everything(),
names_to = "feature",
values_to = "value"
)Now we’ll create a function for plotting a “raincloud” visualization:
plot_raincloud <- function(data, val) {
ggplot(data, aes(x = "1", y = {{ val }})) + # Setup ggplot object
ggdist::stat_halfeye( # Distribution plot
adjust = 0.5,
width = 0.6,
.width = 0,
justification = -0.2,
point_color = NA
) +
geom_boxplot( # Boxplot
width = 0.15,
outlier.shape = NA
) +
gghalves::geom_half_point( # Individual observations
side = "l",
range_scale = 0.4,
alpha = 0.2
) +
facet_wrap(~feature, nrow = 1, scales = "free") + # Two visualizations, side-by-side
theme_minimal() + # Theme styling
theme(
axis.title.x = element_blank(),
axis.title.y = element_blank(),
axis.text.x = element_blank(),
panel.grid.major.x = element_blank(),
strip.background = element_rect(fill = "#BE0000", color = NA),
strip.text = element_text(color = "white", face = "bold")
)
}plot_raincloud(sac_alt, value) # Generate a raincloud plotWe do see the long, skewed tails. While not egregious, ideally we’d see a more normal distribution Would a logarithmic transform help?
sac_alt <- sac_alt %>% mutate(value_log = log(value)) # Add a logarithmic transform
plot_raincloud(sac_alt, value_log) # Genearte another raincloud plotThat’s much better! Let’s end by creating a clean version of our data, featuring the logarithmic transforms:
sac_clean <-
sac_raw %>% # Use the original data
mutate( # Create natural log versions of the variables
sqft_log = log(sqft),
price_log = log(price)
) %>%
select(-c(sqft, price)) # Remove non-log versionsUsing these logged versions will make it more difficult to interpret the model. However, the scenario prompting this analytical objective is mostly around finding potential value among investment properties. Therefore, our goal is less about interpreting the effects of predictors on price and instead on the gap in predicted and actual price (a signal of potential value).
We need only remember to convert the logged predictions back to their normal scale for use in communicating to the client.
Model resources
Splitting for training/testing sets
As is customary, we must split our data into training and testing sets. We’ll put aside the testing set and work with training until we’re comfortable with our model definition. The {rsample} package has a lot of helpful functions for this workflow. In just a couple lines we get training and testing sets.
set.seed(819) # Set reproducible seed
split_obj <- initial_split(sac_clean, prop = 0.75) # Split object
sac_train <- training(split_obj) # Split for training
sac_test <- testing(split_obj) # Split for testingTraining and tuning
Fit a model
What we want to do is use cv.glmnet and specify the cross-validation therein.
nFolds <- 10
foldid <- sample(rep(seq(nFolds), length.out = nrow(sac_train))) # Randomize folds
model_fit <- cv.glmnet( # Run elastic net
x = sac_train %>% select(-price_log) %>% data.matrix(), # Predictors
y = sac_train$price_log, # Target
family = "gaussian", # Regression specification
type.measure = "mse", # Measure of interest
nfolds = nFolds, # Number of cross-validation folds
foldid = foldid # Reproducible folds
)Output results shared below may differ slightly from the descriptions thereafter due to the nature of cv.glmnet.
model_fit
Call: cv.glmnet(x = sac_train %>% select(-price_log) %>% data.matrix(), y = sac_train$price_log, type.measure = "mse", nfolds = nFolds, foldid = foldid, family = "gaussian")
Measure: Mean-Squared Error
Lambda Index Measure SE Nonzero
min 0.00078 68 0.1016 0.008399 8
1se 0.05164 23 0.1099 0.007299 3
The fitted model description gives us some good info:
- We see two values of \(\lambda\), one corresponding to the minimum (
min) mean squared error and the other one (1se) corresponding to within one standard error of the minimum - The index shows low far along the process each value was achieved
- Measure is our means squared error (MSE)
- SE is the standard error
- Nonzero tells us how many features/predictors remained in the model given the chosen lambda
It’s interesting to note that a near traditional OLS model gave the lowest MSE but a simpler model (thanks to a higher lambda) is nearly as good at a ~0.11 MSE.
Evaluating tuning
Let’s see if we can’t plot these results:
plot(model_fit)min is indicated by the far left vertical line while 1se is indicated by the far right vertical line.
We’re also able to look at the coefficients:
coef(model_fit, s = "lambda.1se")9 x 1 sparse Matrix of class "dgCMatrix"
s1
(Intercept) 35.7815602609
city .
zip -0.0005237118
beds .
baths .
type .
latitude .
longitude 0.2458808877
sqft_log 0.8659262743
At first blush, sqft_log and zip make a lot of sense given size and location are some of the most natural drivers of price. The absense of type and inclusion of longitude is curious.
Final model
Predict on testing
The next step is to train the previous model (1se) with the entire training data and then use it to predict training data values. Why 1se and not min? We achieve a far simpler model with little impact to MSE.
Let’s setup the prediction:
sac_pred <- predict( # Make predictions
model_fit, # Original fit from above
newx = sac_test %>% # Using the testing data we haven't worked with yet
select(-price_log) %>%
data.matrix(),
s = "lambda.1se" # Use the penalty within 1 standard error
)
head(sac_pred) lambda.1se
[1,] 11.68184
[2,] 11.97621
[3,] 12.00757
[4,] 11.76178
[5,] 11.99953
[6,] 12.14830
We now want to calculate the mean squared error.
Evaluating metrics
Ideally, we would be in the neighborhood of the values estimated from our cross-validation. Let’s compile the predictions with the original values and do the necessary transformations.
sac_test_r <-
sac_test %>%
mutate( # Transform data back into natural scale
.pred = sac_pred[,1], # Predicted value (log)
.pred_n = exp(.pred), # Predicted value (natural)
sqft = exp(sqft_log), # Sqft (natural)
price = exp(price_log), # Sales price (natural)
diff = .pred_n - price # Difference in price (natural)
)
mean((sac_test_r$.pred - sac_test_r$price_log)^2) # Calculate mean squared error on log scale[1] 0.123031
We are, in fact, achieving about the same MSE with the testing data (0.12) as we did with the training (0.11).
Results
High value properties
The whole goal of this modeling exercise was to find high value properties, where our predicted price exceeds the sales price.
Let’s generate a table of the top 10 properties in modeled value:
sac_test_r %>% # Using the test data
arrange(desc(diff)) %>% # Sort by the difference
select(city:type, sqft, price, .pred_n, diff) %>% # Select relevant columns
head(10) %>% # Top-10 property values
gt() %>% # Create a "great tables" object
cols_label( # Add column labes
city = "City",
zip = "Zip Code",
beds = "Bedroom No",
baths = "Bathroom No",
type = "Property Type",
sqft = "Square Feet",
price = "Sales Price",
.pred_n = "Predicted Price",
diff = "Modeled Value"
) %>%
fmt_number( # Format sqft for decimals
columns = c(sqft),
decimals = 0
) %>%
fmt_currency( # Format pricing data for currency
columns = c(price, .pred_n, diff),
decimals = 0,
suffixing = TRUE
) %>%
tab_options( # Format the column headers
column_labels.background.color = "#BE0000"
)| City | Zip Code | Bedroom No | Bathroom No | Property Type | Square Feet | Sales Price | Predicted Price | Modeled Value |
|---|---|---|---|---|---|---|---|---|
| SACRAMENTO | z95828 | 4 | 2.0 | Residential | 1,512 | $57K | $208K | $151K |
| SACRAMENTO | z95828 | 4 | 4.0 | Multi_Family | 1,995 | $120K | $263K | $143K |
| PLACERVILLE | z95667 | 2 | 2.0 | Residential | 2,750 | $270K | $409K | $139K |
| RIO_LINDA | z95673 | 1 | 2.0 | Residential | 1,000 | $30K | $145K | $115K |
| SACRAMENTO | z95828 | 3 | 2.0 | Residential | 2,161 | $170K | $284K | $114K |
| SACRAMENTO | z95838 | 4 | 3.0 | Residential | 1,890 | $138K | $248K | $110K |
| ELK_GROVE | z95758 | 5 | 3.0 | Residential | 3,468 | $320K | $425K | $105K |
| ANTELOPE | z95843 | 4 | 3.0 | Residential | 2,652 | $240K | $336K | $97K |
| GALT | z95632 | 5 | 3.5 | Residential | 3,499 | $355K | $450K | $95K |
| ELK_GROVE | z95758 | 5 | 3.0 | Residential | 2,790 | $258K | $352K | $94K |
In the case of the very first property, our model suggests the property has a median worth of $208K but was sold at $57K. 1,512 square foot property with 4 bedrooms and 2 baths; at first blush, that does seem like great value.
Plotting high value properties
A common next step would be to map the high value properties. We can do that with the {leaflet} package:
sac_map <-
leaflet(height = 800, width = 800) %>% # Create a leaflet map object
addTiles() %>% # Setup tile layer
setView(lng = -121.478851, lat = 38.575764, zoom = 10) # Localize zoom to Sacramento, CA
sac_map <-
sac_map %>%
addProviderTiles("CartoDB.Positron") %>% # Use a grayscale map theme
addCircleMarkers( # Add a circle point for every property
data = sac_test_r %>% arrange(desc(diff)) %>% head(30), # Top-30, high value properties
lng = ~longitude, # Mapped values
lat = ~latitude, # Mapped values
radius = 6,
color = "#BE0000",
fillOpacity = 0.5
)
sac_map