library('tidyverse') # Wrapper for many convenient libraries
library('modeldata') # Contains data for the assignment
library('skimr') # Quickly summarise data
library('gt') # Render "great tables"
library('e1071') # Loading library for SVM
library('rsample')
library('caret')Support Vector Machine | Hotel Rates
Assignment Questions
Name
What is your name? Include all team members if submitting as a group.
Adam Bushman [u6049169]; no group members.
Perspective
From what perspective ar you conducting the analysis? (Who are you? / Who are you working for?)
I am a data consultant for a hotel chain. They are interested in standardizing pricing strategies and have recruited me to help. They are hoping for a predictive model for nightly room rates. They want to test drive the approach using a single hotel they suspect has had some inconsistent pricing strategies.
Question
What is your question?
Can a predictive model be generated on historical nightly room rates such that 1) underlying patterns are accounted for and 2) unwanted variance is “smoothed” for better consistency across locations within the chain.
The goal would be to settle on a model that fairly explains the variance but isn’t sensitive enough to the suspected anomalies of single hotel location.
Data set
Describe your dataset(s) including URL (if available)
The data set is sourced from Posit via their {modeldata} package. It turns out the original source was a publication in Antonio, de Almeida, and Nunes (2019) and represents historical rates a single hotel (the “Resort Hotel”) in Lisbon, Portugal.
Predictor(s) and target(s)
What is (are) your independent variable(s) and dependent variable(s)? Include variable type (binary, categorical, numeric).
Little was described about the data here in Posit’s reference. However, a comprehensive dictionary was found that wholistically described the variables.
The data set features 28 columns and just over 15.4K rows. There are 9 features of type factor, 18 of type numeric, and 1 of type date. A comprehensive table of variable names, data types, and descriptions can be found here.
The dependent variable (target) aligning with the use case is avg_price_per_room. The independent variables included all but arrival_date (much of the seasonality seen in dates is captured via other features, such as near_christmas and stays_in_weekend_nights).
Model resonability
How are your variables suitable for your analysis method?
The variables chosen are suitable for the analysis method of a Support Vector Machine thanks to the following:
- Many features; while the data set only technically includes 28 columns, there are some categorical fields that contain a high number of unique values. While note explicitly performed
- Generalization; SVM’s do a good job at avoiding overfitting provided the data is pre-processed properly
The model may not be the ultimate best choice given other considerations:
- SVM for classification; generally, Support Vector Machines are used for classification problems but can be deployed in regression scenarios, though less popular
- While potentially unorthodox, I was interested in the viability of Support Vector Regression
- Many records; using kernel tricks will be computationally intensive with the entire data set
Conclusions
What are your conclusions (include references to one or two CLEARLY INDICATED AND IMPORTANT graphs or tables in your output)?
The SVM regression model predicted average nightly hotel rates/prices balancing results of \(RMSE\) and \(R^2\) in cross-validation tuning for cost and sigma on the training data.
Based on the tuning results visualized in this graph, a custom model was fitted to the training data with custom cost and sigma values.
The resulting model is adequate, with performance similar to what was found in cross-validated examples (see resulting \(RMSE\) and \(R^2\) figures here).
It is anticipated the hotel chain will be sufficiently encouraged with this first step and, upon review of the below assumptions, is prepared to invest additional time to improving the model.
Assumptions
What are your assumptions and limitations? Did you include any robustness checks?
SVM regression
I wanted to explore the viability of SVM’s in a regression setting. While traditionally applied to classification problems, they can be applied to regression situations and I wanted to explore that.
I have a sense now that it can be a fair alternative to the OLS family of regression should the data set fail to meet its rigorous assumptions. However, it does have a cost, namely
- Computation (the models took a long time to run and I had to reduce the data set dimensions)
- Iterpretability (there really is no way to peel back the curtain and evaluate individual relationships)
- Performance (the results left some to be desired)
- Model evaluation via \(RMSE\) and \(R^2\)
Chosen values for cross-valdiated hyperparameters
I chose to use the following values for SVM hyperparameters
- sigma: 0.01, 0.10, 0.50
- cost: 0.05, 1.00, 3.00
In one sense, these are arbitrary; I had no rules of thumb, per se. However, I try to cover a reasonably wide range of possible values. The idea is to explore the performance of hyperparameter values across differing thresholds and methodically discover good ranges.
Not transforming target variable
Due to time and complexity, I opted to not center, scale, and transform the target variable. This certainly affected the quality of the predictions. With more time this is a critical next step to boosting performance.
Assignment Workflow
Analysis Prep
Loading packages
Loading the data
We’ll start off by referencing the “Hotel Rates” data for the assignment that we’re sourcing from the {modeldata} package.
hotel_raw <- modeldata::hotel_rates # Data for the assignmentWith it loaded into the session, let’s get a sense for what we’re working with.
Data set inspection
Right away, I like to get acquainted with the data set. That means understanding what each column seeks to describe, confirming the granularity of the rows, and getting my arms around structure, completeness, distribution, etc.
Posit, the company behind {modeldata}, did not include a data dictionary; however, the variables are sufficiently self-explanatory. Below is a derived data dictionary:
gt(hotel_dict) %>% # Create a "great tables" (gt) object
cols_label( # Rename some columns
variable = "Variable Name",
datatype = "Data Type",
description = "Variable Description"
) %>%
tab_options(
column_labels.background.color = "#d9230f", # Apply red header background
column_labels.font.weight = "bold", # Bold headers
row.striping.background = '#FFFFFF' # Remove row striping
)| Variable Name | Data Type | Variable Description |
|---|---|---|
| avg_price_per_room | <double> | Sum of all lodging transactions divided by total number of staying nights |
| lead_time | <double> | Number of days that elapsed between the entering date of the booking into the PMS and the arrival date |
| stays_in_weekend_nights | <double> | Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel |
| stays_in_week_nights | <double> | Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel |
| adults | <double> | Number of adults |
| children | <double> | Number of children |
| babies | <double> | Number of babies |
| meal | <factor> | Type of meal booked |
| country | <factor> | Country of origin |
| market_segment | <factor> | Market segment designation: travel agents (TA), tour operators (TO), etc. |
| distribution_channel | <factor> | Booking distribution channel: travel agents (TA), tour operators (TO), etc. |
| is_repeated_guest | <double> | Value indicating if the booking name was from a repeated guest (1) or not (0) |
| previous_cancellations | <double> | Number of previous bookings that were cancelled by the customer prior to the current booking |
| previous_bookings_not_canceled | <double> | Number of previous bookings not cancelled by the customer prior to the current booking |
| reserved_room_type | <factor> | Code of room type reserved |
| assigned_room_type | <factor> | Code for the type of room assigned to the booking. Sometimes the assigned room type differs from the reserved room type |
| booking_changes | <double> | Number of changes/amendments made to the booking from the moment the booking was entered on the PMS until the moment of check-in or cancellation |
| agent | <factor> | ID of the travel agency that made the booking |
| company | <factor> | ID of the company/entity that made the booking or responsible for paying the booking |
| days_in_waiting_list | <double> | Number of days the booking was in the waiting list before it was confirmed to the customer |
| customer_type | <factor> | Type of booking, assuming one of four categories: contract, group, transient, transient-party |
| required_car_parking_spaces | <double> | Number of car parking spaces required by the customer |
| total_of_special_requests | <double> | Number of special requests made by the customer (e.g. twin bed or high floor) |
| arrival_date | <date> | |
| arrival_date_num | <double> | Date of arrival for hotel stay |
| near_christmas | <double> | Flag for hotel stays near christmas |
| near_new_years | <double> | Flag for hotel stays near new years |
| historical_adr | <double> | Historical average daily rate/price per room |
Using the {skimr} package, we can get a comprehensive summary of the data.
skim(hotel_raw)| Name | hotel_raw |
| Number of rows | 15402 |
| Number of columns | 28 |
| _______________________ | |
| Column type frequency: | |
| Date | 1 |
| factor | 9 |
| numeric | 18 |
| ________________________ | |
| Group variables | None |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| arrival_date | 0 | 1 | 2016-07-02 | 2017-08-31 | 2017-02-05 | 426 |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| meal | 0 | 1 | FALSE | 4 | bed: 11767, bre: 3047, no_: 480, bre: 108 |
| country | 0 | 1 | FALSE | 99 | prt: 4869, gbr: 3465, esp: 1449, irl: 934 |
| market_segment | 0 | 1 | FALSE | 5 | onl: 6742, dir: 3076, off: 2895, gro: 1789 |
| distribution_channel | 0 | 1 | FALSE | 3 | ta_: 10710, dir: 3361, cor: 1331, und: 0 |
| reserved_room_type | 0 | 1 | FALSE | 8 | a: 8571, d: 3058, e: 2046, f: 552 |
| assigned_room_type | 0 | 1 | FALSE | 9 | a: 6046, d: 4216, e: 2274, c: 974 |
| agent | 0 | 1 | FALSE | 125 | dev: 4659, not: 3443, ale: 1484, cha: 931 |
| company | 0 | 1 | FALSE | 167 | not: 14012, par: 388, lin: 72, ber: 71 |
| customer_type | 0 | 1 | FALSE | 4 | tra: 11001, tra: 3432, con: 807, gro: 162 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| avg_price_per_room | 0 | 1 | 104.90 | 65.53 | 19.00 | 55.00 | 82.00 | 142.00 | 426.25 | ▇▃▂▁▁ |
| lead_time | 0 | 1 | 90.03 | 101.58 | 0.00 | 7.00 | 45.00 | 154.00 | 542.00 | ▇▂▁▁▁ |
| stays_in_weekend_nights | 0 | 1 | 1.19 | 1.16 | 0.00 | 0.00 | 1.00 | 2.00 | 19.00 | ▇▁▁▁▁ |
| stays_in_week_nights | 0 | 1 | 3.12 | 2.48 | 0.00 | 1.00 | 3.00 | 5.00 | 50.00 | ▇▁▁▁▁ |
| adults | 0 | 1 | 1.85 | 0.46 | 0.00 | 2.00 | 2.00 | 2.00 | 4.00 | ▁▂▇▁▁ |
| children | 0 | 1 | 0.12 | 0.42 | 0.00 | 0.00 | 0.00 | 0.00 | 3.00 | ▇▁▁▁▁ |
| babies | 0 | 1 | 0.02 | 0.13 | 0.00 | 0.00 | 0.00 | 0.00 | 2.00 | ▇▁▁▁▁ |
| is_repeated_guest | 0 | 1 | 0.07 | 0.25 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| previous_cancellations | 0 | 1 | 0.01 | 0.13 | 0.00 | 0.00 | 0.00 | 0.00 | 5.00 | ▇▁▁▁▁ |
| previous_bookings_not_canceled | 0 | 1 | 0.24 | 1.37 | 0.00 | 0.00 | 0.00 | 0.00 | 30.00 | ▇▁▁▁▁ |
| booking_changes | 0 | 1 | 0.36 | 0.77 | 0.00 | 0.00 | 0.00 | 0.00 | 10.00 | ▇▁▁▁▁ |
| days_in_waiting_list | 0 | 1 | 0.47 | 7.13 | 0.00 | 0.00 | 0.00 | 0.00 | 185.00 | ▇▁▁▁▁ |
| required_car_parking_spaces | 0 | 1 | 0.20 | 0.41 | 0.00 | 0.00 | 0.00 | 0.00 | 8.00 | ▇▁▁▁▁ |
| total_of_special_requests | 0 | 1 | 0.76 | 0.87 | 0.00 | 0.00 | 1.00 | 1.00 | 5.00 | ▇▂▁▁▁ |
| arrival_date_num | 0 | 1 | 2017.08 | 0.33 | 2016.50 | 2016.80 | 2017.10 | 2017.37 | 2017.66 | ▇▇▇▇▇ |
| near_christmas | 0 | 1 | 0.01 | 0.08 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| near_new_years | 0 | 1 | 0.01 | 0.10 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| historical_adr | 0 | 1 | 86.34 | 39.35 | 41.96 | 52.27 | 71.35 | 116.23 | 167.49 | ▇▃▂▂▂ |
Initial observations include:
- With over 15K rows, we may be on the brink of too much data suitable for a support vector machine model; we may want to randomly sample for this set
- We have no missing values; this means we won’t have to eliminate records, features, or determine the best way to impute values
- We do have some skewed distributions, most long, right tails; we’d likely benefit from some logarithmic transformations and natural scaling
- An SVM model should handle many features well; we have some factor data types with many unique values
- country, agent, company
Simple Exploratory Data Analysis
Preprocessing
Data cleaning / Feature Engineering
As mentioned, we’d be wise to transform and scale numeric variables that are highly skewed to the right. We’ll do this for all but the target variable.
We can do this fairly easy with the following code:
hotel_clean <-
hotel_raw |>
mutate(arrival_date_char = factor(as.character(arrival_date_num))) |>
mutate(across(
where(is.numeric) & !c(avg_price_per_room),
~as.numeric(scale(log(. + 0.01 * max(.))))
)) |>
select(-arrival_date_num)glimpse(hotel_clean)Rows: 15,402
Columns: 28
$ avg_price_per_room <dbl> 110.00, 74.00, 81.90, 81.00, 112.20, 90…
$ lead_time <dbl> 1.2443569, 1.3335593, 1.2648218, 1.2293…
$ stays_in_weekend_nights <dbl> -1.2994717, 0.8451441, 0.8451441, 0.845…
$ stays_in_week_nights <dbl> -0.9026065, 0.8961269, 0.8961269, 0.896…
$ adults <dbl> 0.3910441, 0.3910441, 0.3910441, 0.3910…
$ children <dbl> 2.9770184, -0.3066203, -0.3066203, -0.3…
$ babies <dbl> -0.1270955, -0.1270955, -0.1270955, -0.…
$ meal <fct> bed_and_breakfast, bed_and_breakfast, b…
$ country <fct> prt, aus, gbr, prt, gbr, null, prt, esp…
$ market_segment <fct> online_travel_agent, offline_travel_age…
$ distribution_channel <fct> ta_to, ta_to, ta_to, ta_to, ta_to, ta_t…
$ is_repeated_guest <dbl> -0.2653233, -0.2653233, -0.2653233, -0.…
$ previous_cancellations <dbl> -0.07902931, -0.07902931, -0.07902931, …
$ previous_bookings_not_canceled <dbl> -0.2704553, -0.2704553, -0.2704553, -0.…
$ reserved_room_type <fct> a, a, a, a, a, a, f, e, h, a, a, g, a, …
$ assigned_room_type <fct> c, a, c, a, a, a, f, f, h, e, e, g, e, …
$ booking_changes <dbl> -0.5519376, -0.5519376, 1.5205029, -0.5…
$ agent <fct> devin_rivera_borrego, lia_nauth, jawhar…
$ company <fct> not_applicable, not_applicable, not_app…
$ days_in_waiting_list <dbl> -0.07480674, -0.07480674, -0.07480674, …
$ customer_type <fct> transient, transient_party, transient, …
$ required_car_parking_spaces <dbl> 2.0269690, -0.4916827, -0.4916827, -0.4…
$ total_of_special_requests <dbl> 0.7696991, -1.0460414, -1.0460414, 1.16…
$ arrival_date <date> 2016-07-02, 2016-07-02, 2016-07-02, 20…
$ near_christmas <dbl> -0.07877758, -0.07877758, -0.07877758, …
$ near_new_years <dbl> -0.1043769, -0.1043769, -0.1043769, -0.…
$ historical_adr <dbl> 0.6706595, 0.6706595, 0.6706595, 0.6706…
$ arrival_date_char <fct> 2016.5, 2016.5, 2016.5, 2016.5, 2016.5,…
Now we have columns that are 1) logarithmically transformed to approximate the normal distribution and 2) scaled for normality (mean = 0, stdev = 1).
We don’t have any constants, doubles, or bijections in the data, so no need to account for those.
Model resources
Splitting for training/testing sets
As is customary, we must split our data into training and testing sets. We’ll put aside the testing set and work with training until we’re comfortable with our model definition. The {rsample} package has a lot of helpful functions for this workflow. In just a couple lines we get training and testing sets.
set.seed(819) # Set reproducible seed
hotel_sampl <- sample_n(hotel_clean, ceiling(nrow(hotel_clean) * 0.25)) # Take a quarter of the data randomly
split_obj <- initial_split(hotel_sampl, prop = 0.75) # Split object
hotel_train <- training(split_obj) # Split for training
hotel_test <- testing(split_obj) # Split for testingTo expedite the computational process of cross-validating an SVM model, the data was trimmed by 75% via random sampling prior to training/testing splits.
Training and tuning
Cross validation tuning
What we want to do is setup some cross validation for the hyperparameters needed by SVM, specifically sigma and cost. We can do that with {caret} and its trainControl() function:
fit_summary <- function(data, lev = NULL, model = NULL) {
RMSE <- sqrt(mean((data$obs - data$pred)^2))
R2 <- cor(data$obs, data$pred)^2
out <- c(rmse = RMSE, rsq = R2)
return(out)
}
fit_control <- trainControl(
method = "repeatedcv",
number = 3,
repeats = 3,
summaryFunction = fit_summary
)
fit_grid <- expand.grid(
sigma = c(0.01, 0.1, 0.5),
C = c(0.05, 1, 3)
)Next, we define an SVM model using these tuning and cross validation settings.
svm_fit <- train(
avg_price_per_room ~ . -arrival_date,
data = hotel_train,
method = "svmRadial",
trControl = fit_control,
metric = "Rsquared",
preProcess = NULL,
tuneGrid = fit_grid
)svm_fitSupport Vector Machines with Radial Basis Function Kernel
2888 samples
27 predictor
No pre-processing
Resampling: Cross-Validated (3 fold, repeated 3 times)
Summary of sample sizes: 1926, 1924, 1926, 1924, 1926, 1926, ...
Resampling results across tuning parameters:
sigma C rmse rsq
0.01 0.05 68.25766 0.5809770
0.01 1.00 51.34404 0.6992191
0.01 3.00 38.31511 0.7745584
0.10 0.05 68.47585 0.5024344
0.10 1.00 55.24348 0.5995189
0.10 3.00 43.78522 0.6811902
0.50 0.05 69.37848 0.2032568
0.50 1.00 69.02086 0.2117694
0.50 3.00 68.25338 0.2181781
rmse was used to select the optimal model using the largest value.
The final values used for the model were sigma = 0.5 and C = 0.05.
The fitted model description gives us some good info:
- We see the values for sigma and cost (C) that we had defined in our tuning grid
- We see metric values we defined in the train function
We can use this information to make judgements about the best hyper parameters.
Evaluating tuning
Let’s use the produced information to generate a plot. We want to see what happens to \(RMSE\) and \(R^2\) given the changes made to sigma and C.
First, let’s pivot the data to make plotting easier:
results_df <- svm_fit$results |>
select(-c(rmseSD, rsqSD)) |>
pivot_longer(
cols = c("rmse", "rsq"),
names_to = "metric",
values_to = "metric_val"
) |>
pivot_longer(
cols = c("sigma", "C"),
names_to = "hyperparameter",
values_to = "hyperparameter_val"
)
results_df# A tibble: 36 × 4
metric metric_val hyperparameter hyperparameter_val
<chr> <dbl> <chr> <dbl>
1 rmse 68.3 sigma 0.01
2 rmse 68.3 C 0.05
3 rsq 0.581 sigma 0.01
4 rsq 0.581 C 0.05
5 rmse 51.3 sigma 0.01
6 rmse 51.3 C 1
7 rsq 0.699 sigma 0.01
8 rsq 0.699 C 1
9 rmse 38.3 sigma 0.01
10 rmse 38.3 C 3
# ℹ 26 more rows
Now let’s generate a faceted plot:
ggplot(
results_df,
aes(hyperparameter_val, metric_val)
) +
geom_line() +
geom_point() +
facet_grid(metric~hyperparameter, scales = "free") +
theme_minimal() + # Theme styling
theme(
panel.background = element_rect(color = "#707271"),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
panel.grid.major.x = element_blank(),
strip.background = element_rect(fill = "#BE0000"),
strip.text = element_text(color = "white", face = "bold")
)We can look at this plot top to bottom, left to right.
- Hyperparameter “C”
- \(RMSE\) oscilates high to low as as cost goes up
- \(R^2\) also jumps up and down as cost goes up
- No clear indication of a threshold being crossed with cost
- Sigma
- \(RMSE\) starts low with sigma being low and then spikes (remember, we want this error low)
- \(R^2\) does the opposite, starting high and the tanking (remember, we want this value high)
- Clearly, we want Sigma to be low
We’ll proceed to use a custom model that balances both of these well: \(C = 0.05\) and \(sigma = 0.01\).
Final model
Define final model
We’ll create a custom SVM model using the hyperparameters evaluated above.
svm_custom <- svm(
avg_price_per_room ~ . -arrival_date,
data = hotel_train,
cost = 0.05,
sigma = 0.01
)Predict on testing
Let’s proceed to predict values using the tuned model from above but on the testing data set.
svm_pred <- predict(svm_custom, newdata = hotel_test)Let’s combine these values back with our original prices and print 10 random values:
model_results <-
hotel_test |>
select(actual_rate = avg_price_per_room) |>
mutate(pred_rate = svm_pred)
model_results |>
sample_n(10)# A tibble: 10 × 2
actual_rate pred_rate
<dbl> <dbl>
1 87 85.1
2 80 72.9
3 177. 113.
4 217. 119.
5 85.5 85.0
6 45 58.7
7 149 104.
8 70.8 88.0
9 68 71.3
10 50 65.0
Results
We can calculate the final \(RMSE\) and \(R^2\) on the predictions:
list(
RMSE = sqrt(mean((model_results$actual_rate - model_results$pred_rate)^2)),
R2 = cor(model_results$actual_rate, model_results$pred_rate)^2
)$RMSE
[1] 52.73644
$R2
[1] 0.6750478