library('tidyverse') # Wrapper for many convenient libraries
library('modeldata') # Contains data for the assignment
library('skimr') # Quickly summarise data
library('gt') # Render "great tables"
library('xgboost') # Package for leveraging boosted trees
library('tidymodels') # Wrapper for convenient modeling framework
library('vip') # Generating important feature plotsBoosted Trees | Food Delivery Times
Assignment Questions
Name
What is your name? Include all team members if submitting as a group.
Adam Bushman [u6049169]; no group members.
Perspective
From what perspective ar eyou conducting the analysis? (Who are you? / Who are you working for?)
I am a former data professional chasing a dream of owning a restaurant. While menu curation and delighting customers with my cuisine is a passion, the experience in my former career continually prompts me to measure activities in the restaurant and analyze them for decision making. What’s most got me curious of late is the delivery program.
Much of the value a restaurant proposes to customers is the experience of being waited on, dining with friends or family, and enjoying a beautifully presented and exquisitely tasting dish.
The proposition changes in dramatic fashion with take-out. Now a rewarding experience is first and foremost a quick delivery. Managing expectations is the name of the game in delivery. Even 5 minutes of underestimation could result in a negative review.
Question
What is your question?
Can a predictive model be generated to 1) accurately predict time from order to delivery and 2) shed light on situations not-suitable for delivery? This would be valuable information as we work to make the delivery program as successful as our dine-in experience.
Data set
Describe your dataset(s) including URL (if available)
The data set is sourced from Posit via their {modeldata} package. The dataset itself is named Food Delivery Time Data. There is no mention of the original source of the data nor is their any mention of it being synthetically generated.
Predictor(s) and target(s)
What is (are) your independent variable(s) and dependent variable(s)? Include variable type (binary, categorical, numeric).
The dataset is nicely aligned with my needs as a restauranteer looking to optimize the delivery program. The dependent (target) variable in this dataset is time_to_delivery, defined as the time from initial order to receiving the food.
Independent (predictor) variables come in two flavors (pardon the pun):
- The dataset contains three (3) context-related variables:
hour(of the day order was received),day(of the week order was received), anddistance, the approximate number of miles between restaurant and delivery location. - Lastly, there exist 27
item_variables that measure the quantity of the corresponding menu item included in the order.
There are just of 10,000 rows in the data. Additional information (including data types) can be found by referencing the data dictionary.
Model resonability
How are your variables suitable for your analysis method?
The variables are suitable for a boosted trees approach given none have obvious, direct relationships with time_to_delivery. We can see in the EDA portion that many of the features possessed complex relationships that would be tough to extract with OLS family of models.
Boosted trees are also ideal given we have a fair volume of predictors (30) and given our target variable isn’t linear or normally distributed.
Conclusions
What are your conclusions (include references to one or two CLEARLY INDICATED AND IMPORTANT graphs or tables in your output)?
I leveraged a boosted trees regression model using XGBoost in the {tidymodels} framework. I dynamically tuned 4 hyperparameters (mtry, trees, tree_depth, and learning_rate).
The best model produced some excellent performance metrics: \(R^2=0.902\) and \(RMSE=0.2\). These are great indications of a robust model for predicting time_to_delivery.
Another important motivation was to identify areas of operations for which to focus. We isolated the important features of the model.
While boosted trees are an opaque model that won’t give us detailed relationship estimates, we were able to say that the hour and weekend day were the two areas to stress operational efficiency for controling delivery times.
Assumptions
What are your assumptions and limitations? Did you include any robustness checks?
We made the assumption that this problem was an ideal one for boosted trees. As a restauranteer, achieving a high performing model was of interest. Given the non-parametric nature of the data, found in exploratory data analysis, this aligns well.
However, my interest in explaining relationships between predictors and target is not well addressed by an opaque model like boosted trees. We did look into important features of the model and gained some simple insights, but weren’t able to describe the complex interplay.
We did ensure high data quality and pushed hard to ensure no-overfitting with the extensive hyperparameter tuning and cross-validation.
Assignment Workflow
Analysis Prep
Loading packages
Loading the data
We’ll start off by referencing the “Food Delivery Times” data for the assignment that we’re sourcing from the {modeldata} package.
delivery_raw <- modeldata::deliveries # Data for the assignmentWith it loaded into the session, let’s get a sense for what we’re working with.
Data set inspection
Right away, I like to get acquainted with the data set. That means understanding what each column seeks to describe, confirming the granularity of the rows, and getting my arms around structure, completeness, distribution, etc.
The following data dictionary was included in the description of the dataset:
gt(delivery_dict) %>% # Create a "great tables" (gt) object
cols_label( # Rename some columns
variable = "Variable Name",
datatype = "Data Type",
description = "Variable Description"
) %>%
tab_options(
column_labels.background.color = "#d9230f", # Apply red header background
column_labels.font.weight = "bold", # Bold headers
row.striping.background = '#FFFFFF' # Remove row striping
)| Variable Name | Data Type | Variable Description |
|---|---|---|
| time_to_delivery | <numeric> | Time from the initial order to receiving the food in minutes |
| hour | <numeric> | The time, in decimal hours, of the order |
| day | <factor> | The day of the week for the order |
| distance | <numeric> | The approximate distance in miles between the restaurant and the delivery location. |
| item_## | <integer> | A set of 27 predictors that count the number of distinct menu items in the order |
Using the {skimr} package, we can get a comprehensive summary of the data.
skim(delivery_raw)| Name | delivery_raw |
| Number of rows | 10012 |
| Number of columns | 31 |
| _______________________ | |
| Column type frequency: | |
| factor | 1 |
| numeric | 30 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| day | 0 | 1 | FALSE | 7 | Sat: 2231, Fri: 2133, Thu: 1697, Sun: 1432 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| time_to_delivery | 0 | 1 | 26.14 | 6.88 | 12.31 | 21.07 | 26.02 | 30.40 | 60.54 | ▅▇▂▁▁ |
| hour | 0 | 1 | 16.28 | 2.44 | 11.03 | 14.48 | 16.57 | 18.22 | 20.93 | ▃▅▇▇▅ |
| distance | 0 | 1 | 2.99 | 1.00 | 1.75 | 2.29 | 2.69 | 3.38 | 11.38 | ▇▂▁▁▁ |
| item_01 | 0 | 1 | 0.09 | 0.29 | 0.00 | 0.00 | 0.00 | 0.00 | 2.00 | ▇▁▁▁▁ |
| item_02 | 0 | 1 | 0.20 | 0.44 | 0.00 | 0.00 | 0.00 | 0.00 | 3.00 | ▇▂▁▁▁ |
| item_03 | 0 | 1 | 0.18 | 0.43 | 0.00 | 0.00 | 0.00 | 0.00 | 4.00 | ▇▂▁▁▁ |
| item_04 | 0 | 1 | 0.23 | 0.47 | 0.00 | 0.00 | 0.00 | 0.00 | 3.00 | ▇▂▁▁▁ |
| item_05 | 0 | 1 | 0.15 | 0.38 | 0.00 | 0.00 | 0.00 | 0.00 | 3.00 | ▇▁▁▁▁ |
| item_06 | 0 | 1 | 0.17 | 0.41 | 0.00 | 0.00 | 0.00 | 0.00 | 3.00 | ▇▂▁▁▁ |
| item_07 | 0 | 1 | 0.18 | 0.43 | 0.00 | 0.00 | 0.00 | 0.00 | 4.00 | ▇▂▁▁▁ |
| item_08 | 0 | 1 | 0.20 | 0.45 | 0.00 | 0.00 | 0.00 | 0.00 | 3.00 | ▇▂▁▁▁ |
| item_09 | 0 | 1 | 0.16 | 0.39 | 0.00 | 0.00 | 0.00 | 0.00 | 3.00 | ▇▁▁▁▁ |
| item_10 | 0 | 1 | 0.08 | 0.29 | 0.00 | 0.00 | 0.00 | 0.00 | 3.00 | ▇▁▁▁▁ |
| item_11 | 0 | 1 | 0.21 | 0.46 | 0.00 | 0.00 | 0.00 | 0.00 | 3.00 | ▇▂▁▁▁ |
| item_12 | 0 | 1 | 0.12 | 0.34 | 0.00 | 0.00 | 0.00 | 0.00 | 2.00 | ▇▁▁▁▁ |
| item_13 | 0 | 1 | 0.11 | 0.33 | 0.00 | 0.00 | 0.00 | 0.00 | 3.00 | ▇▁▁▁▁ |
| item_14 | 0 | 1 | 0.12 | 0.34 | 0.00 | 0.00 | 0.00 | 0.00 | 3.00 | ▇▁▁▁▁ |
| item_15 | 0 | 1 | 0.08 | 0.28 | 0.00 | 0.00 | 0.00 | 0.00 | 3.00 | ▇▁▁▁▁ |
| item_16 | 0 | 1 | 0.05 | 0.22 | 0.00 | 0.00 | 0.00 | 0.00 | 3.00 | ▇▁▁▁▁ |
| item_17 | 0 | 1 | 0.09 | 0.30 | 0.00 | 0.00 | 0.00 | 0.00 | 3.00 | ▇▁▁▁▁ |
| item_18 | 0 | 1 | 0.09 | 0.30 | 0.00 | 0.00 | 0.00 | 0.00 | 2.00 | ▇▁▁▁▁ |
| item_19 | 0 | 1 | 0.04 | 0.21 | 0.00 | 0.00 | 0.00 | 0.00 | 2.00 | ▇▁▁▁▁ |
| item_20 | 0 | 1 | 0.07 | 0.26 | 0.00 | 0.00 | 0.00 | 0.00 | 3.00 | ▇▁▁▁▁ |
| item_21 | 0 | 1 | 0.07 | 0.26 | 0.00 | 0.00 | 0.00 | 0.00 | 3.00 | ▇▁▁▁▁ |
| item_22 | 0 | 1 | 0.18 | 0.42 | 0.00 | 0.00 | 0.00 | 0.00 | 3.00 | ▇▂▁▁▁ |
| item_23 | 0 | 1 | 0.09 | 0.29 | 0.00 | 0.00 | 0.00 | 0.00 | 2.00 | ▇▁▁▁▁ |
| item_24 | 0 | 1 | 0.21 | 0.45 | 0.00 | 0.00 | 0.00 | 0.00 | 3.00 | ▇▂▁▁▁ |
| item_25 | 0 | 1 | 0.11 | 0.33 | 0.00 | 0.00 | 0.00 | 0.00 | 3.00 | ▇▁▁▁▁ |
| item_26 | 0 | 1 | 0.14 | 0.38 | 0.00 | 0.00 | 0.00 | 0.00 | 3.00 | ▇▁▁▁▁ |
| item_27 | 0 | 1 | 0.10 | 0.32 | 0.00 | 0.00 | 0.00 | 0.00 | 3.00 | ▇▁▁▁▁ |
Initial observations include:
- We have a wealth of data
- Over 10,000 rows and 31 columns
- All the data are complete
- No missing values in any of the columns
- All variables look to be of the right data type
- As to be expected, the order counts have some skewness to them
- Most of the time, an item is not ordered (i.e.
0); highest order counts of an item are3or4 - Time to delivery and distance are also skewed left
- Since we are using a trees based approach, we shouldn’t need to worry about approximating a normal distribution
- Most of the time, an item is not ordered (i.e.
Simple Exploratory Data Analysis
Let’s do some additional exploration of the data.
We saw that time to delivery ranged from ~12 mins to ~61. You’d have to imagine some items take longer to prepare than others. Let’s generate a quick table to evaluate just that.
delivery_raw |>
pivot_longer(
cols = tidyr::starts_with("item"),
names_to = "item",
values_to = "quantity"
) |>
filter(quantity != 0) |>
group_by(item) |>
summarise(stats = list(summary(time_to_delivery))) |>
tidyr::unnest_wider(stats) |>
mutate(Range = Max. - Min.) |>
arrange(desc(Range)) |>
gt() |>
cols_label( # Rename some columns
item = "Menu Item",
`1st Qu.` = "1st Quartile",
`3rd Qu.` = "3rd Quartile"
) %>%
tab_options(
column_labels.background.color = "#d9230f", # Apply red header background
column_labels.font.weight = "bold", # Bold headers
row.striping.background = '#FFFFFF' # Remove row striping
)| Menu Item | Min. | 1st Quartile | Median | Mean | 3rd Quartile | Max. | Range |
|---|---|---|---|---|---|---|---|
| item_04 | 12.4243 | 21.04100 | 26.24230 | 26.34440 | 30.66820 | 60.5413 | 48.1170 |
| item_24 | 12.5772 | 21.57020 | 26.52810 | 26.67812 | 31.02150 | 60.5151 | 47.9379 |
| item_08 | 12.6531 | 21.25125 | 26.15300 | 26.27765 | 30.51350 | 60.5413 | 47.8882 |
| item_22 | 12.6531 | 21.33063 | 26.51490 | 26.59577 | 30.96657 | 60.5151 | 47.8620 |
| item_25 | 12.6531 | 21.19508 | 26.26600 | 26.38065 | 30.48848 | 60.5151 | 47.8620 |
| item_14 | 12.7070 | 21.21465 | 26.36260 | 26.28226 | 30.54808 | 60.5151 | 47.8081 |
| item_27 | 12.4243 | 22.18740 | 26.70665 | 27.05776 | 31.16780 | 59.6949 | 47.2706 |
| item_07 | 12.4783 | 21.63160 | 26.34930 | 26.62898 | 30.97460 | 59.6949 | 47.2166 |
| item_11 | 12.6531 | 21.15215 | 26.25080 | 26.33654 | 30.72222 | 59.6949 | 47.0418 |
| item_15 | 13.1673 | 22.13648 | 26.87465 | 26.94593 | 31.65355 | 59.6949 | 46.5276 |
| item_10 | 13.1800 | 21.32140 | 26.49450 | 27.62746 | 32.61770 | 59.6949 | 46.5149 |
| item_02 | 12.5772 | 21.11925 | 26.50460 | 26.52747 | 30.83420 | 58.9796 | 46.4024 |
| item_23 | 12.9038 | 21.60370 | 26.67610 | 26.81352 | 30.95390 | 58.9796 | 46.0758 |
| item_03 | 12.9205 | 21.10970 | 26.12910 | 26.33570 | 30.48390 | 58.2942 | 45.3737 |
| item_13 | 12.9479 | 21.46220 | 26.45620 | 26.53547 | 30.71245 | 58.2942 | 45.3463 |
| item_12 | 12.3121 | 22.13505 | 26.67990 | 26.84016 | 31.11745 | 56.7099 | 44.3978 |
| item_01 | 13.1225 | 22.37620 | 27.53270 | 27.80509 | 32.09680 | 57.2644 | 44.1419 |
| item_09 | 12.8637 | 21.18978 | 26.32805 | 26.44626 | 30.73500 | 56.7099 | 43.8462 |
| item_26 | 12.9038 | 22.10530 | 26.79300 | 26.97761 | 31.36960 | 56.7099 | 43.8061 |
| item_18 | 12.9479 | 21.53610 | 26.46420 | 26.47133 | 30.67960 | 56.7200 | 43.7721 |
| item_06 | 12.5772 | 21.55695 | 26.26435 | 26.50349 | 30.70245 | 55.6706 | 43.0934 |
| item_17 | 13.1171 | 21.75330 | 26.65790 | 26.79584 | 30.99150 | 56.0992 | 42.9821 |
| item_20 | 13.1521 | 21.13478 | 26.37165 | 26.41644 | 30.98742 | 55.5648 | 42.4127 |
| item_05 | 12.6531 | 21.04755 | 26.13525 | 26.19520 | 30.70595 | 54.2924 | 41.6393 |
| item_16 | 13.4684 | 21.45665 | 26.20950 | 26.54157 | 30.41558 | 54.2924 | 40.8240 |
| item_21 | 13.2936 | 21.20717 | 26.36830 | 26.40288 | 30.70838 | 52.5625 | 39.2689 |
| item_19 | 13.0344 | 20.34940 | 25.29405 | 25.31554 | 29.67482 | 47.7540 | 34.7196 |
This table gives us some great info. Most menu items, in a best case scenario, take about the same amount of time to delivery (all of the min and 1st quartile values are about the same). There are no drastic worst case scenario as the longest delivery times are all about the same. There are a handful of menu items, most notable item_19, item21, item_16, and item_05 that have notably more narrow range of delivery times. These menu items are far less sensitive.
Another thing we can do is look at the total quantity of an order compared to delivery time. This we can do in pretty short order:
delivery_raw |>
mutate(total_qty = rowSums(across(item_01:item_27))) |>
ggplot(aes(total_qty, time_to_delivery)) +
geom_jitter(color = '#BE0000') +
labs(
title = "Relationship between order quantity and delivery time",
x = "Total Quantity",
y = "Delivery Time"
) +
theme_minimal() +
theme(
plot.title.position = "plot"
)If you squint hard, maybe you can see a relationship. Ultimately, it’s confounded by distance. We could try the same thing but “bin” certain delivery ranges. Let’s give that a go:
delivery_raw |>
mutate(
total_qty = rowSums(across(item_01:item_27)),
distance_bin = cut(distance, 6)
) |>
ggplot(aes(total_qty, time_to_delivery)) +
geom_jitter(color = '#707271') +
facet_wrap(~distance_bin, ncol = 2) +
labs(
title = "Relationship between order quantity and delivery time",
x = "Total Quantity",
y = "Delivery Time"
) +
theme_minimal() +
theme(
plot.title.position = "plot",
panel.background = element_rect(color = "#707271"),
strip.background = element_rect(fill = "#BE0000"),
strip.text = element_text(color = "white", face = "bold")
)Interesting. It doesn’t appear the total order quantity has a strong relationship to delivery time, even when controling for distance. Let’s do this same thing but for day of the week:
delivery_raw |>
mutate(
total_qty = rowSums(across(item_01:item_27))
) |>
ggplot(aes(total_qty, time_to_delivery)) +
geom_jitter(color = '#707271') +
facet_wrap(~day, nrow = 2) +
labs(
title = "Relationship between order quantity and delivery time",
x = "Total Quantity",
y = "Delivery Time"
) +
theme_minimal() +
theme(
plot.title.position = "plot",
panel.background = element_rect(color = "#707271"),
strip.background = element_rect(fill = "#BE0000"),
strip.text = element_text(color = "white", face = "bold")
)Very interesting. It seemse that the range of delivery time is day dependent but even controlling for that confounder, we aren’t seeing any hint of a relationship between order quantity and delivery time. As a restauranteer, this is somewhat curious since larger orders feel like they are more intensive to cook and package.
Conclusion
We’re so far not picking up on any potential relationships. This is shaping up as an ideal problem for a boosted tree.
Preprocessing
Data cleaning / Feature Engineering
As we’ve seen, there isn’t really any cleaning or feature engineering to do on the dataset. Tree-based algorithms don’t necessitate transformations for normal distributions or linear/polynomial relationships.
Previous to the EDA work, it was hypothesized that creating a total_qty feature would be valuable. Based on our previous results, that seems to be an insignificant feature. We won’t worry about that for now and just proceed to the next step.
We will need to convert the day variable from factor to dummy/one-hot-encoded variables. However, we’ll save that for the “recipe” step since {tidymodels} offers some real handy approaches to this.
Model resources
Splitting for training/testing sets
As is customary, we must split our data into training and testing sets. We’ll put aside the testing set and work with training until we’re comfortable with our model definition. The {rsample} package has a lot of helpful functions for this workflow. In just a couple lines we get training and testing sets.
set.seed(814) # Set reproducible seed
split_obj <- initial_split(delivery_raw, prop = 0.75) # Split object
delivery_train <- training(split_obj) # Split for training
delivery_test <- testing(split_obj) # Split for testingRecipe
We’re going to practice using the {tidymodels} framework for this modeling exercise. There’s a lot of neat features we can make use of, first of which is a “recipe”. As a restauranteer, this always made sense to me. Just like a dish will call for ingredients and assembly steps, so does a modeling recipe.
delivery_rec <- recipe(time_to_delivery ~ ., delivery_train) |>
step_dummy(day)The recipe now knows what our modeling formula is and is prepped for use of the training data. Next, we’ll set up some cross validation.
Cross validation tuning
Cross validation is important to ensure the results of a single model isn’t an outlier. We’d hate to get a singular amazing result by chance that didn’t align with the true range of performance.
Another important reason for cross validation, in particular for a boosted tree model, is choosing hyperparameters. While running models on each fold of the data, we’ll simultaneously be searching for ideal hyperparameters that give the best performance. Hyperparameters such as:
- Number of trees
- The depth of trees
- The learning rate
- Etc.
Let’s first create our cross validation folds. We set v = 5 because we want 5 folds:
delivery_folds <- rsample::vfold_cv(delivery_train, v = 5)Model details
Now we’ll start defining the model and the hyperparameters to tune.
delivery_modl <- boost_tree(
trees = tune(),
tree_depth = tune(),
learn_rate = tune(),
mtry = tune()
) |>
set_engine("xgboost", verbosity = 0) |>
set_mode("regression")Next we’ll setup the tuning.
Hyperparameter tuning
We’ll set up a grid that keeps track of the unique combinations for these hyperparameters. As we run a model on the cross-validation folds, we’ll use those combinations in the model. The resulting performance of the model we’ll be saved. At the end, we’ll get a sense for the combination giving the best results.
delivery_tune_grid <- grid_regular(
trees(),
tree_depth(),
learn_rate(),
mtry(c(1, ceiling(sqrt(30)))),
levels = 4
)
head(delivery_tune_grid)# A tibble: 6 × 4
trees tree_depth learn_rate mtry
<int> <int> <dbl> <int>
1 1 1 0.0000000001 1
2 667 1 0.0000000001 1
3 1333 1 0.0000000001 1
4 2000 1 0.0000000001 1
5 1 5 0.0000000001 1
6 667 5 0.0000000001 1
tail(delivery_tune_grid)# A tibble: 6 × 4
trees tree_depth learn_rate mtry
<int> <int> <dbl> <int>
1 1333 10 0.1 6
2 2000 10 0.1 6
3 1 15 0.1 6
4 667 15 0.1 6
5 1333 15 0.1 6
6 2000 15 0.1 6
Model training
Workflow configuration
We next define a workflow that will take all these pieces and run them cohesively. We have a 1) recipe, 2) cross validation folds, 3) model, and 4) a hyperparameter tuning grid. Let’s integrate them.
delivery_wflw <-
workflow() |>
add_model(delivery_modl) |>
add_recipe(delivery_rec)Next, we’ll tune the model like so (we’ll setup parallel processing to make the tuning process go quicker):
doParallel::registerDoParallel(cores = 10)
set.seed(2015)
delivery_tune <-
delivery_wflw |>
tune_grid(
resamples = delivery_folds,
grid = delivery_tune_grid,
metrics = metric_set(rmse)
)The model will calculate Root Mean Squared Error (\(RMSE\)) which will be used to determining the “best model” based on hyperparameters used.
Tuning Evaluation
results_tune <- collect_metrics(delivery_tune) |> arrange(mean)
head(results_tune)# A tibble: 6 × 10
mtry trees tree_depth learn_rate .metric .estimator mean n std_err
<int> <int> <int> <dbl> <chr> <chr> <dbl> <int> <dbl>
1 6 1333 5 0.1 rmse standard 2.05 5 0.0235
2 6 667 5 0.1 rmse standard 2.06 5 0.0219
3 6 2000 5 0.1 rmse standard 2.06 5 0.0227
4 4 1333 5 0.1 rmse standard 2.08 5 0.0262
5 4 2000 5 0.1 rmse standard 2.08 5 0.0249
6 4 667 5 0.1 rmse standard 2.12 5 0.0305
# ℹ 1 more variable: .config <chr>
tail(results_tune)# A tibble: 6 × 10
mtry trees tree_depth learn_rate .metric .estimator mean n std_err
<int> <int> <int> <dbl> <chr> <chr> <dbl> <int> <dbl>
1 4 1333 15 0.0000000001 rmse standard 26.5 5 0.0383
2 4 2000 15 0.0000000001 rmse standard 26.5 5 0.0383
3 6 1 15 0.0000000001 rmse standard 26.5 5 0.0383
4 6 667 15 0.0000000001 rmse standard 26.5 5 0.0383
5 6 1333 15 0.0000000001 rmse standard 26.5 5 0.0383
6 6 2000 15 0.0000000001 rmse standard 26.5 5 0.0383
# ℹ 1 more variable: .config <chr>
We see the best models have an \(RMSE\) of approximately 2 while the worst models are in the 26 range. We can create a visualization that puts this all together:
ggplot(
results_tune |> pivot_longer(
c(mtry, trees, tree_depth, learn_rate)
) |> mutate(mean_bin = cut(mean, 10)),
aes(x = value, y = mean, color = mean_bin)
) +
geom_jitter(size = 5, alpha = 0.25) +
facet_wrap(~name, scales = "free") +
labs(
title = "Boosted trees hyperparameter values",
x = "Hyperparameter Value",
y = "RMSE",
color = "RMSE Group"
) +
theme_minimal() +
theme(
plot.title.position = "plot",
panel.background = element_rect(color = "#707271"),
strip.background = element_rect(fill = "#BE0000"),
strip.text = element_text(color = "white", face = "bold"),
legend.position = "top",
legend.justification.left = "top"
)Each box corresponds to a hyperparameter. The x-axis tracks the hyperparameter value while the y-axis tracks \(RMSE\).
We’ll proceed to use the best model: \(mtry = 6\), \(trees = 1,333\), \(treeDepth = 5\), and \(learningRate = 0.1\).
Final model
Define final model
We can pull the final model by selecting the “best” based on RMSE:
best_modl <- delivery_tune |> select_best(metric = "rmse")With that, we finalize our workflow:
final_wflw <- delivery_wflw |> finalize_workflow(best_modl)Predict on testing
Let’s proceed to predict values using the tuned model from above but on the testing data set.
final_fit <- final_wflw |> last_fit(split = split_obj)We can now capture performance measures from the final fit:
final_fit |> collect_metrics()# A tibble: 2 × 4
.metric .estimator .estimate .config
<chr> <chr> <dbl> <chr>
1 rmse standard 2.22 Preprocessor1_Model1
2 rsq standard 0.902 Preprocessor1_Model1
We’re capturing an \(R^2\) of 0.902, meaning our model is explaining 90% of the variance on test data. This is shaping up to be a strong predictive model for time_to_delivery.
Let’s move on to seeing what features are most important to predicting time_to_delivery.
Results
We already know that there’s complex relationships and given boosted trees are a pretty opaque model, we won’t be able to zero in on these relationships in a powerful way.
However, we can generate a plot for important features that could be helpful in my quest to learn about operations of the restaurant that would setup the delivery program for success.
final_wflw |>
fit(data = delivery_train) |>
extract_fit_parsnip() |>
vip(geom = "col", aesthetics = list(fill = "#BE0000")) +
labs(
title = "Important Features of Boosted Trees"
) +
theme_minimal() +
theme(
plot.title.position = "plot"
)It’s clear from this plot that I should focus on operations depending on hour of the day and weekends.