Support Vector Machine | Hotel Rates

Leveraging support vector machines in R to predict hotel rates

Author

Adam Bushman

Published

October 22, 2024

Assignment Questions

Name

What is your name? Include all team members if submitting as a group.

Adam Bushman [u6049169]; no group members.

Perspective

From what perspective ar you conducting the analysis? (Who are you? / Who are you working for?)

I am a data consultant for a hotel chain. They are interested in standardizing pricing strategies and have recruited me to help. They are hoping for a predictive model for nightly room rates. They want to test drive the approach using a single hotel they suspect has had some inconsistent pricing strategies.

Question

What is your question?

Can a predictive model be generated on historical nightly room rates such that 1) underlying patterns are accounted for and 2) unwanted variance is “smoothed” for better consistency across locations within the chain.

The goal would be to settle on a model that fairly explains the variance but isn’t sensitive enough to the suspected anomalies of single hotel location.

Data set

Describe your dataset(s) including URL (if available)

The data set is sourced from Posit via their {modeldata} package. It turns out the original source was a publication in Antonio, de Almeida, and Nunes (2019) and represents historical rates a single hotel (the “Resort Hotel”) in Lisbon, Portugal.

Predictor(s) and target(s)

What is (are) your independent variable(s) and dependent variable(s)? Include variable type (binary, categorical, numeric).

Little was described about the data here in Posit’s reference. However, a comprehensive dictionary was found that wholistically described the variables.

The data set features 28 columns and just over 15.4K rows. There are 9 features of type factor, 18 of type numeric, and 1 of type date. A comprehensive table of variable names, data types, and descriptions can be found here.

The dependent variable (target) aligning with the use case is avg_price_per_room. The independent variables included all but arrival_date (much of the seasonality seen in dates is captured via other features, such as near_christmas and stays_in_weekend_nights).

Model resonability

How are your variables suitable for your analysis method?

The variables chosen are suitable for the analysis method of a Support Vector Machine thanks to the following:

Many features; while the data set only technically includes 28 columns, there are some categorical fields that contain a high number of unique values. While note explicitly performed
Generalization; SVM’s do a good job at avoiding overfitting provided the data is pre-processed properly

The model may not be the ultimate best choice given other considerations:

SVM for classification; generally, Support Vector Machines are used for classification problems but can be deployed in regression scenarios, though less popular
- While potentially unorthodox, I was interested in the viability of Support Vector Regression
Many records; using kernel tricks will be computationally intensive with the entire data set

Conclusions

What are your conclusions (include references to one or two CLEARLY INDICATED AND IMPORTANT graphs or tables in your output)?

The SVM regression model predicted average nightly hotel rates/prices balancing results of \(RMSE\) and \(R^2\) in cross-validation tuning for cost and sigma on the training data.

Based on the tuning results visualized in this graph, a custom model was fitted to the training data with custom cost and sigma values.

The resulting model is adequate, with performance similar to what was found in cross-validated examples (see resulting \(RMSE\) and \(R^2\) figures here).

It is anticipated the hotel chain will be sufficiently encouraged with this first step and, upon review of the below assumptions, is prepared to invest additional time to improving the model.

Assumptions

What are your assumptions and limitations? Did you include any robustness checks?

SVM regression

I wanted to explore the viability of SVM’s in a regression setting. While traditionally applied to classification problems, they can be applied to regression situations and I wanted to explore that.

I have a sense now that it can be a fair alternative to the OLS family of regression should the data set fail to meet its rigorous assumptions. However, it does have a cost, namely

Computation (the models took a long time to run and I had to reduce the data set dimensions)
Iterpretability (there really is no way to peel back the curtain and evaluate individual relationships)
Performance (the results left some to be desired)
- Model evaluation via \(RMSE\) and \(R^2\)

Chosen values for cross-valdiated hyperparameters

I chose to use the following values for SVM hyperparameters

sigma: 0.01, 0.10, 0.50
cost: 0.05, 1.00, 3.00

In one sense, these are arbitrary; I had no rules of thumb, per se. However, I try to cover a reasonably wide range of possible values. The idea is to explore the performance of hyperparameter values across differing thresholds and methodically discover good ranges.

Not transforming target variable

Due to time and complexity, I opted to not center, scale, and transform the target variable. This certainly affected the quality of the predictions. With more time this is a critical next step to boosting performance.

Assignment Workflow

Analysis Prep

Loading packages

library('tidyverse')        # Wrapper for many convenient libraries
library('modeldata')        # Contains data for the assignment
library('skimr')            # Quickly summarise data
library('gt')               # Render "great tables"

library('e1071')           # Loading library for SVM
library('rsample')
library('caret')

Loading the data

We’ll start off by referencing the “Hotel Rates” data for the assignment that we’re sourcing from the {modeldata} package.

hotel_raw <- modeldata::hotel_rates        # Data for the assignment

With it loaded into the session, let’s get a sense for what we’re working with.

Data set inspection

Right away, I like to get acquainted with the data set. That means understanding what each column seeks to describe, confirming the granularity of the rows, and getting my arms around structure, completeness, distribution, etc.

Posit, the company behind {modeldata}, did not include a data dictionary; however, the variables are sufficiently self-explanatory. Below is a derived data dictionary:

gt(hotel_dict) %>%                                   # Create a "great tables" (gt) object
    cols_label(                                         # Rename some columns
        variable = "Variable Name", 
        datatype = "Data Type", 
        description = "Variable Description"
    ) %>%
    tab_options(
        column_labels.background.color = "#d9230f",     # Apply red header background
        column_labels.font.weight = "bold",             # Bold headers
        row.striping.background = '#FFFFFF'             # Remove row striping
    )

Variable Name	Data Type	Variable Description
avg_price_per_room	<double>	Sum of all lodging transactions divided by total number of staying nights
lead_time	<double>	Number of days that elapsed between the entering date of the booking into the PMS and the arrival date
stays_in_weekend_nights	<double>	Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel
stays_in_week_nights	<double>	Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel
adults	<double>	Number of adults
children	<double>	Number of children
babies	<double>	Number of babies
meal	<factor>	Type of meal booked
country	<factor>	Country of origin
market_segment	<factor>	Market segment designation: travel agents (TA), tour operators (TO), etc.
distribution_channel	<factor>	Booking distribution channel: travel agents (TA), tour operators (TO), etc.
is_repeated_guest	<double>	Value indicating if the booking name was from a repeated guest (1) or not (0)
previous_cancellations	<double>	Number of previous bookings that were cancelled by the customer prior to the current booking
previous_bookings_not_canceled	<double>	Number of previous bookings not cancelled by the customer prior to the current booking
reserved_room_type	<factor>	Code of room type reserved
assigned_room_type	<factor>	Code for the type of room assigned to the booking. Sometimes the assigned room type differs from the reserved room type
booking_changes	<double>	Number of changes/amendments made to the booking from the moment the booking was entered on the PMS until the moment of check-in or cancellation
agent	<factor>	ID of the travel agency that made the booking
company	<factor>	ID of the company/entity that made the booking or responsible for paying the booking
days_in_waiting_list	<double>	Number of days the booking was in the waiting list before it was confirmed to the customer
customer_type	<factor>	Type of booking, assuming one of four categories: contract, group, transient, transient-party
required_car_parking_spaces	<double>	Number of car parking spaces required by the customer
total_of_special_requests	<double>	Number of special requests made by the customer (e.g. twin bed or high floor)
arrival_date	<date>
arrival_date_num	<double>	Date of arrival for hotel stay
near_christmas	<double>	Flag for hotel stays near christmas
near_new_years	<double>	Flag for hotel stays near new years
historical_adr	<double>	Historical average daily rate/price per room

Using the {skimr} package, we can get a comprehensive summary of the data.

skim(hotel_raw)

Data summary
Name	hotel_raw
Number of rows	15402
Number of columns	28
_______________________
Column type frequency:
Date	1
factor	9
numeric	18
________________________
Group variables	None

Variable type: Date

skim_variable	n_missing	complete_rate	min	max	median	n_unique
arrival_date	0	1	2016-07-02	2017-08-31	2017-02-05	426

Variable type: factor

skim_variable	complete_rate	ordered	n_unique	top_counts
meal	1	FALSE	4	bed: 11767, bre: 3047, no_: 480, bre: 108
country	1	FALSE	99	prt: 4869, gbr: 3465, esp: 1449, irl: 934
market_segment	1	FALSE	5	onl: 6742, dir: 3076, off: 2895, gro: 1789
distribution_channel	1	FALSE	3	ta_: 10710, dir: 3361, cor: 1331, und: 0
reserved_room_type	1	FALSE	8	a: 8571, d: 3058, e: 2046, f: 552
assigned_room_type	1	FALSE	9	a: 6046, d: 4216, e: 2274, c: 974
agent	1	FALSE	125	dev: 4659, not: 3443, ale: 1484, cha: 931
company	1	FALSE	167	not: 14012, par: 388, lin: 72, ber: 71
customer_type	1	FALSE	4	tra: 11001, tra: 3432, con: 807, gro: 162

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
avg_price_per_room	1	104.90	65.53	19.00	55.00	82.00	142.00	426.25	▇▃▂▁▁
lead_time	1	90.03	101.58	0.00	7.00	45.00	154.00	542.00	▇▂▁▁▁
stays_in_weekend_nights	1	1.19	1.16	0.00	0.00	1.00	2.00	19.00	▇▁▁▁▁
stays_in_week_nights	1	3.12	2.48	0.00	1.00	3.00	5.00	50.00	▇▁▁▁▁
adults	1	1.85	0.46	0.00	2.00	2.00	2.00	4.00	▁▂▇▁▁
children	1	0.12	0.42	0.00	0.00	0.00	0.00	3.00	▇▁▁▁▁
babies	1	0.02	0.13	0.00	0.00	0.00	0.00	2.00	▇▁▁▁▁
is_repeated_guest	1	0.07	0.25	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▁
previous_cancellations	1	0.01	0.13	0.00	0.00	0.00	0.00	5.00	▇▁▁▁▁
previous_bookings_not_canceled	1	0.24	1.37	0.00	0.00	0.00	0.00	30.00	▇▁▁▁▁
booking_changes	1	0.36	0.77	0.00	0.00	0.00	0.00	10.00	▇▁▁▁▁
days_in_waiting_list	1	0.47	7.13	0.00	0.00	0.00	0.00	185.00	▇▁▁▁▁
required_car_parking_spaces	1	0.20	0.41	0.00	0.00	0.00	0.00	8.00	▇▁▁▁▁
total_of_special_requests	1	0.76	0.87	0.00	0.00	1.00	1.00	5.00	▇▂▁▁▁
arrival_date_num	1	2017.08	0.33	2016.50	2016.80	2017.10	2017.37	2017.66	▇▇▇▇▇
near_christmas	1	0.01	0.08	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▁
near_new_years	1	0.01	0.10	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▁
historical_adr	1	86.34	39.35	41.96	52.27	71.35	116.23	167.49	▇▃▂▂▂

Initial observations include:

With over 15K rows, we may be on the brink of too much data suitable for a support vector machine model; we may want to randomly sample for this set
We have no missing values; this means we won’t have to eliminate records, features, or determine the best way to impute values
We do have some skewed distributions, most long, right tails; we’d likely benefit from some logarithmic transformations and natural scaling
An SVM model should handle many features well; we have some factor data types with many unique values
- country, agent, company

Simple Exploratory Data Analysis

Preprocessing

Data cleaning / Feature Engineering

As mentioned, we’d be wise to transform and scale numeric variables that are highly skewed to the right. We’ll do this for all but the target variable.

We can do this fairly easy with the following code:

hotel_clean <- 
    hotel_raw |>
    mutate(arrival_date_char = factor(as.character(arrival_date_num))) |>
    mutate(across(
        where(is.numeric) & !c(avg_price_per_room), 
        ~as.numeric(scale(log(. + 0.01 * max(.))))
    )) |>
    select(-arrival_date_num)

glimpse(hotel_clean)

Rows: 15,402
Columns: 28
$ avg_price_per_room             <dbl> 110.00, 74.00, 81.90, 81.00, 112.20, 90…
$ lead_time                      <dbl> 1.2443569, 1.3335593, 1.2648218, 1.2293…
$ stays_in_weekend_nights        <dbl> -1.2994717, 0.8451441, 0.8451441, 0.845…
$ stays_in_week_nights           <dbl> -0.9026065, 0.8961269, 0.8961269, 0.896…
$ adults                         <dbl> 0.3910441, 0.3910441, 0.3910441, 0.3910…
$ children                       <dbl> 2.9770184, -0.3066203, -0.3066203, -0.3…
$ babies                         <dbl> -0.1270955, -0.1270955, -0.1270955, -0.…
$ meal                           <fct> bed_and_breakfast, bed_and_breakfast, b…
$ country                        <fct> prt, aus, gbr, prt, gbr, null, prt, esp…
$ market_segment                 <fct> online_travel_agent, offline_travel_age…
$ distribution_channel           <fct> ta_to, ta_to, ta_to, ta_to, ta_to, ta_t…
$ is_repeated_guest              <dbl> -0.2653233, -0.2653233, -0.2653233, -0.…
$ previous_cancellations         <dbl> -0.07902931, -0.07902931, -0.07902931, …
$ previous_bookings_not_canceled <dbl> -0.2704553, -0.2704553, -0.2704553, -0.…
$ reserved_room_type             <fct> a, a, a, a, a, a, f, e, h, a, a, g, a, …
$ assigned_room_type             <fct> c, a, c, a, a, a, f, f, h, e, e, g, e, …
$ booking_changes                <dbl> -0.5519376, -0.5519376, 1.5205029, -0.5…
$ agent                          <fct> devin_rivera_borrego, lia_nauth, jawhar…
$ company                        <fct> not_applicable, not_applicable, not_app…
$ days_in_waiting_list           <dbl> -0.07480674, -0.07480674, -0.07480674, …
$ customer_type                  <fct> transient, transient_party, transient, …
$ required_car_parking_spaces    <dbl> 2.0269690, -0.4916827, -0.4916827, -0.4…
$ total_of_special_requests      <dbl> 0.7696991, -1.0460414, -1.0460414, 1.16…
$ arrival_date                   <date> 2016-07-02, 2016-07-02, 2016-07-02, 20…
$ near_christmas                 <dbl> -0.07877758, -0.07877758, -0.07877758, …
$ near_new_years                 <dbl> -0.1043769, -0.1043769, -0.1043769, -0.…
$ historical_adr                 <dbl> 0.6706595, 0.6706595, 0.6706595, 0.6706…
$ arrival_date_char              <fct> 2016.5, 2016.5, 2016.5, 2016.5, 2016.5,…

Now we have columns that are 1) logarithmically transformed to approximate the normal distribution and 2) scaled for normality (mean = 0, stdev = 1).

We don’t have any constants, doubles, or bijections in the data, so no need to account for those.

Model resources

Splitting for training/testing sets

As is customary, we must split our data into training and testing sets. We’ll put aside the testing set and work with training until we’re comfortable with our model definition. The {rsample} package has a lot of helpful functions for this workflow. In just a couple lines we get training and testing sets.

set.seed(819)                                                               # Set reproducible seed
hotel_sampl <- sample_n(hotel_clean, ceiling(nrow(hotel_clean) * 0.25))     # Take a quarter of the data randomly
split_obj <- initial_split(hotel_sampl, prop = 0.75)                        # Split object

hotel_train <- training(split_obj)                                          # Split for training
hotel_test <- testing(split_obj)                                            # Split for testing

Note

To expedite the computational process of cross-validating an SVM model, the data was trimmed by 75% via random sampling prior to training/testing splits.

Training and tuning

Cross validation tuning

What we want to do is setup some cross validation for the hyperparameters needed by SVM, specifically sigma and cost. We can do that with {caret} and its trainControl() function:

fit_summary <- function(data, lev = NULL, model = NULL) {
  RMSE <- sqrt(mean((data$obs - data$pred)^2))
  R2 <- cor(data$obs, data$pred)^2
  
  out <- c(rmse = RMSE, rsq = R2)
  return(out)
}

fit_control <- trainControl(
    method = "repeatedcv", 
    number = 3, 
    repeats = 3, 
    summaryFunction = fit_summary
)

fit_grid <- expand.grid(
    sigma = c(0.01, 0.1, 0.5), 
    C = c(0.05, 1, 3)
)

Next, we define an SVM model using these tuning and cross validation settings.

svm_fit <- train(
    avg_price_per_room ~ . -arrival_date, 
    data = hotel_train, 
    method = "svmRadial", 
    trControl = fit_control, 
    metric = "Rsquared", 
    preProcess = NULL, 
    tuneGrid = fit_grid
)

svm_fit

Support Vector Machines with Radial Basis Function Kernel 

2888 samples
  27 predictor

No pre-processing
Resampling: Cross-Validated (3 fold, repeated 3 times) 
Summary of sample sizes: 1926, 1924, 1926, 1924, 1926, 1926, ... 
Resampling results across tuning parameters:

  sigma  C     rmse      rsq      
  0.01   0.05  68.25766  0.5809770
  0.01   1.00  51.34404  0.6992191
  0.01   3.00  38.31511  0.7745584
  0.10   0.05  68.47585  0.5024344
  0.10   1.00  55.24348  0.5995189
  0.10   3.00  43.78522  0.6811902
  0.50   0.05  69.37848  0.2032568
  0.50   1.00  69.02086  0.2117694
  0.50   3.00  68.25338  0.2181781

rmse was used to select the optimal model using the largest value.
The final values used for the model were sigma = 0.5 and C = 0.05.

The fitted model description gives us some good info:

We see the values for sigma and cost (C) that we had defined in our tuning grid
We see metric values we defined in the train function

We can use this information to make judgements about the best hyper parameters.

Evaluating tuning

Let’s use the produced information to generate a plot. We want to see what happens to \(RMSE\) and \(R^2\) given the changes made to sigma and C.

First, let’s pivot the data to make plotting easier:

results_df <- svm_fit$results |>
    select(-c(rmseSD, rsqSD)) |>
    pivot_longer(
        cols = c("rmse", "rsq"),
        names_to = "metric",
        values_to = "metric_val"
    ) |>
    pivot_longer(
        cols = c("sigma", "C"),
        names_to = "hyperparameter",
        values_to = "hyperparameter_val"
    )

results_df

# A tibble: 36 × 4
   metric metric_val hyperparameter hyperparameter_val
   <chr>       <dbl> <chr>                       <dbl>
 1 rmse       68.3   sigma                        0.01
 2 rmse       68.3   C                            0.05
 3 rsq         0.581 sigma                        0.01
 4 rsq         0.581 C                            0.05
 5 rmse       51.3   sigma                        0.01
 6 rmse       51.3   C                            1   
 7 rsq         0.699 sigma                        0.01
 8 rsq         0.699 C                            1   
 9 rmse       38.3   sigma                        0.01
10 rmse       38.3   C                            3   
# ℹ 26 more rows

Now let’s generate a faceted plot:

ggplot(
    results_df, 
    aes(hyperparameter_val, metric_val)
) +
    geom_line() +
    geom_point() +
    facet_grid(metric~hyperparameter, scales = "free") +
    theme_minimal() +                                                           # Theme styling
    theme(
        panel.background = element_rect(color = "#707271"), 
        axis.title.x = element_blank(), 
        axis.title.y = element_blank(), 
        panel.grid.major.x = element_blank(), 
        strip.background = element_rect(fill = "#BE0000"), 
        strip.text = element_text(color = "white", face = "bold")
    )

We can look at this plot top to bottom, left to right.

Hyperparameter “C”
- \(RMSE\) oscilates high to low as as cost goes up
- \(R^2\) also jumps up and down as cost goes up
- No clear indication of a threshold being crossed with cost
Sigma
- \(RMSE\) starts low with sigma being low and then spikes (remember, we want this error low)
- \(R^2\) does the opposite, starting high and the tanking (remember, we want this value high)
- Clearly, we want Sigma to be low

We’ll proceed to use a custom model that balances both of these well: \(C = 0.05\) and \(sigma = 0.01\).

Final model

Define final model

We’ll create a custom SVM model using the hyperparameters evaluated above.

svm_custom <- svm(
    avg_price_per_room ~ . -arrival_date, 
    data = hotel_train, 
    cost = 0.05, 
    sigma = 0.01
)

Predict on testing

Let’s proceed to predict values using the tuned model from above but on the testing data set.

svm_pred <- predict(svm_custom, newdata = hotel_test)

Let’s combine these values back with our original prices and print 10 random values:

model_results <- 
    hotel_test |>
    select(actual_rate = avg_price_per_room) |>
    mutate(pred_rate = svm_pred)

model_results |>
    sample_n(10)

# A tibble: 10 × 2
   actual_rate pred_rate
         <dbl>     <dbl>
 1        87        85.1
 2        80        72.9
 3       177.      113. 
 4       217.      119. 
 5        85.5      85.0
 6        45        58.7
 7       149       104. 
 8        70.8      88.0
 9        68        71.3
10        50        65.0

Results

We can calculate the final \(RMSE\) and \(R^2\) on the predictions:

list(
    RMSE = sqrt(mean((model_results$actual_rate - model_results$pred_rate)^2)), 
    R2 = cor(model_results$actual_rate, model_results$pred_rate)^2
)

$RMSE
[1] 52.73644

$R2
[1] 0.6750478