Penalized Regression | Sacramento Housing Data

Leveraging penalized regression in R to predict housing prices in Sacramento, CA

Author

Adam Bushman

Published

October 8, 2024

Assignment Questions

Name

What is your name? Include all team members if submitting as a group.

Adam Bushman [u6049169]; no group members.

Perspective

From what perspective ar you conducting the analysis? (Who are you? / Who are you working for?)

I am a real estate broker, working with clients eager to invest in residential real estate. These clients want their capital to stretch as far as possible; said differently, they’re looking for good deals (highest value per dollar spent) on each property. My job as the broker is to identify high value properties and guide them to submitting competitive yet high ROI offers.

Question

What is your question?

My newest client is looking for properties in the Sacramento, CA market. I’m far less familiar with Sacramento, as I normally operate in the Bay Area (San Francisco, Oakland, etc). Therefore, I must rely on a more data-driven approach to finding high value properties.

Can I develop a solid predictive model for housing prices in the Sacramento area, where a delta in the predicted price and listed sales price may indicate potential value?

Data set

Describe your dataset(s) including URL (if available)

Data sourced from Posit via their {modeldata} package, accessed October 4th, 2024. The data was originally sourced from Insurity Spacial Key. Their original description of the datais as follows:

The Sacramento real estate transactions file is a list of 985 real estate transactions in the Sacramento area reported over a five-day period, as reported by the Sacramento Bee.

The data set includes attributes relative to location of the property and common specification measures.

Predictor(s) and target(s)

What is (are) your independent variable(s) and dependent variable(s)? Include variable type (binary, categorical, numeric).

The dependent variable (target) natural to this situation is price. Because the client wants to find properties whose worth exceeds that of the sales price, this is the right choice. The independent variables (predictors) are city, zip, beds, baths, sqft, type, latitude, and longitude.

For a complete description of each feature and their data types, navigate to the data dictionary.

Model resonability

How are your variables suitable for your analysis method?

The variables chosen above are the extent of the dataset (all features used). The analysis method chosen is a penalized regression elastic net model.

These variables and the analysis method are a natural pairing thanks to:

Meeting the required assumptions
- Linear relationship between predictors and target
- Normally distributed residuals
- Homoscedasticity of residuals (balanced values)
Presence of multicollinearity
- Penalized regression resolves multicollinearity thanks to the applied penalty
Feature reduction
- While this dataset only has 8 predictors, the unique values of factor data types will expand the scope
  - city has 37 unique values
  - zip has 68 unique values
  - type has 3 unique values

Conclusions

What are your conclusions (include references to one or two CLEARLY INDICATED AND IMPORTANT graphs or tables in your output)?

Model effectiveness

A penalized regression model achieved the objective of a) leveraging a data set ill-suited for a traditual Ordinary Least Squares (OLS) regression model and b) achiving a smaller model with nearly the same performance.

Throughout the training workflow, the measurement for Mean Squared Error was found to be ~0.11 while the testing workflow resulted in an MSE of ~0.12.

High value properties

As a broker, the resulting predicitons of price compared to the original “sales” price were a tremendous aid in pinpointing high value properties for the client. Given my lack of familiarity with the Sacramento, CA market, this proved an invaluable first step in a) learning the market and b) accelerating the process.

Navigate to a list of the top-10 high value properties uncovered by the model.

Touring the properties

As a broker, the next step would be to get the client to the properties. We’re already saving time and money by prioritizing properties that best align to the interest of the client. To best prepare for the tours, I would want to get a better sense for absolute and relative location.

Navigate to a map of high value properties, an ideal tool for expediting the process.

Assumptions

What are your assumptions and limitations? Did you include any robustness checks?

Assumptions made throughout the analysis

Understood that this is a sample of housing prices
Assumed that these are prices of “historically sold” properties in the area
Assumed that the client and broker (myself) define “high-value” as a delta between predicted price and sales price
Understood that “value” may exist in other areas, not just price (i.e. time to closing, offer terms, etc.)

Assignment Workflow

Analysis Prep

Loading packages

library('tidyverse')        # Wrapper for many convenient libraries
library('modeldata')        # Contains data for the assignment
library('skimr')            # Quickly summarise data
library('gt')               # Render "great tables"
library('leaflet')          # Make interactive maps

library('glmnet')           # Loading penalized regression library
library('rsample')          # Easy sampling and splitting
library('corrplot')         # Quick correlation plots

Loading the data

We’ll start off by referencing the “Sacramento Housing Data” for the assignment from the {modeldata} package.

sac_raw <- modeldata::Sacramento        # Data for the assignment

With it loaded into the session, let’s get a sense for what we’re working with.

Data set inspection

Right away, I like to get acquainted with the data set. That means understanding what each column seeks to describe, confirming the granularity of the rows, and getting my arms around structure, completeness, distribution, etc.

Posit, the company behind {modeldata}, did not include a data dictionary; however, the variables are sufficiently self-explanatory. Below is a derived data dictionary:

gt(sac_data_dict) %>%                                   # Create a "great tables" (gt) object
    cols_label(                                         # Rename some columns
        variable = "Variable Name", 
        datatype = "Data Type", 
        description = "Variable Description"
    ) %>%
    tab_options(
        column_labels.background.color = "#d9230f",     # Apply red header background
        column_labels.font.weight = "bold",             # Bold headers
        row.striping.background = '#FFFFFF'             # Remove row striping
    )

Variable Name	Data Type	Variable Description
City	<factor>	City/community/suburb of Sacramento, CA
Zip	<factor>	Zip code of Sacramento, CA
Beds	<integer>	Number of bedrooms
Baths	<double>	Number of bathrooms
SqFt	<integer>	Square footage of the property
Type	<factor>	Type of property
Price	<factor>	Sale price of the property
Latitude	<double>	Latitude coordinate value
Longitude	<double>	Longitude coordinate value

Using the {skimr} package, we can get a comprehensive summary of the data.

skim(sac_raw)

Data summary
Name	sac_raw
Number of rows	932
Number of columns	9
_______________________
Column type frequency:
factor	3
numeric	6
________________________
Group variables	None

Variable type: factor

skim_variable	complete_rate	ordered	n_unique	top_counts
city	1	FALSE	37	SAC: 438, ELK: 114, ROS: 48, CIT: 35
zip	1	FALSE	68	z95: 61, z95: 45, z95: 44, z95: 37
type	1	FALSE	3	Res: 866, Con: 53, Mul: 13

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
beds	1	3.28	0.89	1.00	3.00	3.00	4.00	8.00	▃▇▆▁▁
baths	1	2.05	0.72	1.00	2.00	2.00	2.00	5.00	▂▇▂▁▁
sqft	1	1680.32	726.27	484.00	1166.75	1470.00	1953.50	4878.00	▇▇▂▁▁
price	1	246661.58	131126.86	30000.00	156000.00	220000.00	305000.00	884790.00	▇▇▂▁▁
latitude	1	38.59	0.13	38.24	38.48	38.62	38.69	39.02	▁▆▇▂▁
longitude	1	-121.36	0.14	-121.55	-121.45	-121.38	-121.31	-120.60	▇▅▁▁▁

Initial observations include:

With nearly 1K rows, we have a good sample size to work with
We have no missing values; this means we won’t have to eliminate records, features, or determine the best way to impute values
We so have some skewed distributions, most long, right tails; we’d likely benefit from some logarithmic transformations
The categorical/factor variables city and zip naturally have many distinct values. This could very well be where penalized regression shines in feature reduction
Latitude and longitude aren’t variables we’d expect a 1 unit increase/decrease to have some underlying, consistent relationship with price

Simple Exploratory Data Analysis

Multicollinearity

One of the primary problems solved by penalized regression is multicollinearity (predictors that are highly correlated to each other). Let’s generate a correlation matrix for the numeric predictors:

sac_corr <- 
    sac_raw %>%
    select(where(is.numeric)) %>%               # Select numeric variables
    cor()                                       # Run pearson correlation

corrplot.mixed(sac_corr, order = 'AOE')         # Correlation plot

Understandably, the non-latitude/longitude variables are highly correlated. The more square feet of a home, naturally the price will go up since its cost is directly associated with the size. Similar intuition applies to baths and beds.

Fortunately, penalized regression will handle this without a problem.

It’s important to check other assumptions, namely:

Linear relationship between each predictor and outcome
Normally distributed residuals
Homoscedasticity of residuals

We can check all three of these by running a basic linear model and generating plots therefrom:

test_mod <- lm(price ~ ., data = sac_raw)       # Test linear model

Linear relationship & Normally distributed residuals

Because we’re using a parametric, linear test with penalized regression, we need some semblance of a linear relationship. The below plots give enough indication there are linear relationships worth capturing in a model.

plot(sac_raw)       # Scatter plots

Additionally, we need the residuals to approximate the normal distribution. The below plot indicates that we are sufficiently achieving both. There is more deviation from the line on the right tail, which we’ve already pointed out can be mitigated with a logarithmic transform where appropriate.

plot(test_mod, which = 2)           # Normally distributed residuals visualization

Homoscedasticity of residuals

We need the residuals to be fairly consistent in their distribution across fitted values. We don’t want to be seeing a cone shape, where the distribution is high on one end and narrow on the other. The below plot shows pretty consistent distribution of points along the y axis for all the values on the x axis.

plot(test_mod, which = 3)           # Homoscedasticity of residuals visualization

Preprocessing

Data cleaning / Feature Engineering

The data set is already complete and nicely formatted. Little, if anything, is needed on this front. We’ve mentioned previously that two variables, in particular looked to have right-skewed distributions: price and sqft. Let’s take a peek at these for confirmation.

First, we’ve alter the data for a long format (will make it easier to plot).

sac_alt <- 
    sac_raw %>%                     # Use the raw data
    select(price, sqft) %>%         # Select only price, sqft
    pivot_longer(                   # Pivot to long format
        cols = everything(), 
        names_to = "feature", 
        values_to = "value"
    )

Now we’ll create a function for plotting a “raincloud” visualization:

plot_raincloud <- function(data, val) {
    ggplot(data, aes(x = "1", y = {{ val }})) +                                 # Setup ggplot object
    ggdist::stat_halfeye(                                                       # Distribution plot
        adjust = 0.5, 
        width = 0.6, 
        .width = 0, 
        justification = -0.2, 
        point_color = NA
    ) +
    geom_boxplot(                                                               # Boxplot
        width = 0.15, 
        outlier.shape = NA
    ) +
    gghalves::geom_half_point(                                                  # Individual observations
        side = "l", 
        range_scale = 0.4, 
        alpha = 0.2
    ) + 
    facet_wrap(~feature, nrow = 1, scales = "free") +                           # Two visualizations, side-by-side
    theme_minimal() +                                                           # Theme styling
    theme(
        axis.title.x = element_blank(), 
        axis.title.y = element_blank(), 
        axis.text.x = element_blank(), 
        panel.grid.major.x = element_blank(), 
        strip.background = element_rect(fill = "#BE0000", color = NA), 
        strip.text = element_text(color = "white", face = "bold")
    )
}

plot_raincloud(sac_alt, value)          # Generate a raincloud plot

We do see the long, skewed tails. While not egregious, ideally we’d see a more normal distribution Would a logarithmic transform help?

sac_alt <- sac_alt %>% mutate(value_log = log(value))           # Add a logarithmic transform

plot_raincloud(sac_alt, value_log)                              # Genearte another raincloud plot

That’s much better! Let’s end by creating a clean version of our data, featuring the logarithmic transforms:

sac_clean <-
    sac_raw %>%                                                 # Use the original data
    mutate(                                                     # Create natural log versions of the variables
        sqft_log = log(sqft), 
        price_log = log(price)
    ) %>%
    select(-c(sqft, price))                                     # Remove non-log versions

Using these logged versions will make it more difficult to interpret the model. However, the scenario prompting this analytical objective is mostly around finding potential value among investment properties. Therefore, our goal is less about interpreting the effects of predictors on price and instead on the gap in predicted and actual price (a signal of potential value).

We need only remember to convert the logged predictions back to their normal scale for use in communicating to the client.

Model resources

Splitting for training/testing sets

As is customary, we must split our data into training and testing sets. We’ll put aside the testing set and work with training until we’re comfortable with our model definition. The {rsample} package has a lot of helpful functions for this workflow. In just a couple lines we get training and testing sets.

set.seed(819)                                                   # Set reproducible seed
split_obj <- initial_split(sac_clean, prop = 0.75)              # Split object

sac_train <- training(split_obj)                                # Split for training
sac_test <- testing(split_obj)                                  # Split for testing

Training and tuning

Fit a model

What we want to do is use cv.glmnet and specify the cross-validation therein.

nFolds <- 10
foldid <- sample(rep(seq(nFolds), length.out = nrow(sac_train)))        # Randomize folds

model_fit <- cv.glmnet(                                                 # Run elastic net
    x = sac_train %>% select(-price_log) %>% data.matrix(),             # Predictors
    y = sac_train$price_log,                                            # Target
    family = "gaussian",                                                # Regression specification
    type.measure = "mse",                                               # Measure of interest
    nfolds = nFolds,                                                    # Number of cross-validation folds
    foldid = foldid                                                     # Reproducible folds
)

Note

Output results shared below may differ slightly from the descriptions thereafter due to the nature of cv.glmnet.

model_fit


Call:  cv.glmnet(x = sac_train %>% select(-price_log) %>% data.matrix(),      y = sac_train$price_log, type.measure = "mse", nfolds = nFolds,      foldid = foldid, family = "gaussian") 

Measure: Mean-Squared Error 

     Lambda Index Measure       SE Nonzero
min 0.00078    68  0.1016 0.008399       8
1se 0.05164    23  0.1099 0.007299       3

The fitted model description gives us some good info:

We see two values of $\lambda$, one corresponding to the minimum (min) mean squared error and the other one (1se) corresponding to within one standard error of the minimum
The index shows low far along the process each value was achieved
Measure is our means squared error (MSE)
SE is the standard error
Nonzero tells us how many features/predictors remained in the model given the chosen lambda

It’s interesting to note that a near traditional OLS model gave the lowest MSE but a simpler model (thanks to a higher lambda) is nearly as good at a ~0.11 MSE.

Evaluating tuning

Let’s see if we can’t plot these results:

plot(model_fit)

min is indicated by the far left vertical line while 1se is indicated by the far right vertical line.

We’re also able to look at the coefficients:

coef(model_fit, s = "lambda.1se")

9 x 1 sparse Matrix of class "dgCMatrix"
                       s1
(Intercept) 35.7815602609
city         .           
zip         -0.0005237118
beds         .           
baths        .           
type         .           
latitude     .           
longitude    0.2458808877
sqft_log     0.8659262743

At first blush, sqft_log and zip make a lot of sense given size and location are some of the most natural drivers of price. The absense of type and inclusion of longitude is curious.

Final model

Predict on testing

The next step is to train the previous model (1se) with the entire training data and then use it to predict training data values. Why 1se and not min? We achieve a far simpler model with little impact to MSE.

Let’s setup the prediction:

sac_pred <- predict(                            # Make predictions
    model_fit,                                  # Original fit from above
    newx = sac_test %>%                         # Using the testing data we haven't worked with yet
        select(-price_log) %>% 
        data.matrix(),
    s = "lambda.1se"                            # Use the penalty within 1 standard error
)

head(sac_pred)

     lambda.1se
[1,]   11.68184
[2,]   11.97621
[3,]   12.00757
[4,]   11.76178
[5,]   11.99953
[6,]   12.14830

We now want to calculate the mean squared error.

Evaluating metrics

Ideally, we would be in the neighborhood of the values estimated from our cross-validation. Let’s compile the predictions with the original values and do the necessary transformations.

sac_test_r <-                                   
    sac_test %>%
    mutate(                                                     # Transform data back into natural scale
        .pred = sac_pred[,1],                                   # Predicted value (log)
        .pred_n = exp(.pred),                                   # Predicted value (natural)
        sqft = exp(sqft_log),                                   # Sqft (natural)
        price = exp(price_log),                                 # Sales price (natural)
        diff = .pred_n - price                                  # Difference in price (natural)
    )


mean((sac_test_r$.pred - sac_test_r$price_log)^2)               # Calculate mean squared error on log scale

[1] 0.123031

We are, in fact, achieving about the same MSE with the testing data (_{0.12) as we did with the training (}0.11).

Results

High value properties

The whole goal of this modeling exercise was to find high value properties, where our predicted price exceeds the sales price.

Let’s generate a table of the top 10 properties in modeled value:

sac_test_r %>%                                                          # Using the test data
    arrange(desc(diff)) %>%                                             # Sort by the difference
    select(city:type, sqft, price, .pred_n, diff) %>%                   # Select relevant columns
    head(10) %>%                                                        # Top-10 property values
    gt() %>%                                                            # Create a "great tables" object
    cols_label(                                                         # Add column labes
        city = "City", 
        zip = "Zip Code", 
        beds = "Bedroom No", 
        baths = "Bathroom No", 
        type = "Property Type", 
        sqft = "Square Feet", 
        price = "Sales Price", 
        .pred_n = "Predicted Price", 
        diff = "Modeled Value"
    ) %>%
    fmt_number(                                                         # Format sqft for decimals
        columns = c(sqft), 
        decimals = 0
    ) %>%
    fmt_currency(                                                       # Format pricing data for currency
        columns = c(price, .pred_n, diff), 
        decimals = 0, 
        suffixing = TRUE
    ) %>%
    tab_options(                                                        # Format the column headers
        column_labels.background.color = "#BE0000"
    )

City	Zip Code	Bedroom No	Bathroom No	Property Type	Square Feet	Sales Price	Predicted Price	Modeled Value
SACRAMENTO	z95828	4	2.0	Residential	1,512	$57K	$208K	$151K
SACRAMENTO	z95828	4	4.0	Multi_Family	1,995	$120K	$263K	$143K
PLACERVILLE	z95667	2	2.0	Residential	2,750	$270K	$409K	$139K
RIO_LINDA	z95673	1	2.0	Residential	1,000	$30K	$145K	$115K
SACRAMENTO	z95828	3	2.0	Residential	2,161	$170K	$284K	$114K
SACRAMENTO	z95838	4	3.0	Residential	1,890	$138K	$248K	$110K
ELK_GROVE	z95758	5	3.0	Residential	3,468	$320K	$425K	$105K
ANTELOPE	z95843	4	3.0	Residential	2,652	$240K	$336K	$97K
GALT	z95632	5	3.5	Residential	3,499	$355K	$450K	$95K
ELK_GROVE	z95758	5	3.0	Residential	2,790	$258K	$352K	$94K

In the case of the very first property, our model suggests the property has a median worth of $208K but was sold at $57K. 1,512 square foot property with 4 bedrooms and 2 baths; at first blush, that does seem like great value.

Plotting high value properties

A common next step would be to map the high value properties. We can do that with the {leaflet} package:

sac_map <-                                                              
    leaflet(height = 800, width = 800) %>%                              # Create a leaflet map object
    addTiles() %>%                                                      # Setup tile layer
    setView(lng = -121.478851, lat = 38.575764, zoom = 10)              # Localize zoom to Sacramento, CA

sac_map <- 
    sac_map %>%
    addProviderTiles("CartoDB.Positron") %>%                            # Use a grayscale map theme
    addCircleMarkers(                                                   # Add a circle point for every property
        data = sac_test_r %>% arrange(desc(diff)) %>% head(30),         # Top-30, high value properties
        lng = ~longitude,                                               # Mapped values
        lat = ~latitude,                                                # Mapped values
        radius = 6, 
        color = "#BE0000", 
        fillOpacity = 0.5
    )

sac_map