library('tidyverse') # Wrapper for many convenient libraries
library('tidymodels') # Wrapper for modeling libraries
library('modeldata') # Contains data for the assignment
library('skimr') # Quickly summarise data
library('gt') # Render "great tables"
library('nnet') # Loading neural network library
library('VIM') # Nearest neighbors imputation
library('themis') # Library for generating a balanced data set ROSE
library('caret') # Miscellaneous modeling functionsNeural Network | Palmer Penguin Data
Assignment Questions
Name
What is your name? Include all team members if submitting as a group.
Adam Bushman [u6049169]; no group members.
Perspective
From what perspective ar you conducting the analysis? (Who are you? / Who are you working for?)
I am a researcher working to classify some penguin specie observations as part of a research project on the heals of the Palmer Archipelago expedition from which the source data was collected.
Question
What is your question?
Using the source data from the Palmer Archipelago, can our research group develop an accurate neural network model for classifying species of penguin?
This is important to our research group in efforts to personalize protection efforts by species.
Data set
Describe your dataset(s) including URL (if available)
Data sourced from Posit via their {modeldata} package, accessed November 25th, 2024. The data was originally sourced from {palmerpenguins} package. Described as “A data set from Gorman, Williams, and Fraser (2014) containing measurements from different types of penguins.” Full citation below:
Horst AM, Hill AP, Gorman KB (2020). palmerpenguins: Palmer Archipelago (Antarctica) penguin data. R package version 0.1.0. https://allisonhorst.github.io/palmerpenguins/. doi: 10.5281/zenodo.3960218.
Predictor(s) and target(s)
What is (are) your independent variable(s) and dependent variable(s)? Include variable type (binary, categorical, numeric).
The dependent variable (target) for this use-case to this situation is species, a categorical feature. The remaining variables are suitable as independent variables (predictors):
- Numeric:
bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g - Categorical:
islandandsex
For a complete description of each feature, navigate to the data dictionary.
Model resonability
How are your variables suitable for your analysis method?
The variables chosen above are the extent of the dataset (all features used). The analysis method is for a neural network classification model. These variables and the analysis method are a natural pairing thanks to:
- Multiclass classification
- There are likely to be some complex relationships to explain differences in
species
- There are likely to be some complex relationships to explain differences in
- Imbalanced data
- While we can and did test our results with balanced data, a neural network is more insensitive to this issue
Conclusions
What are your conclusions (include references to one or two CLEARLY INDICATED AND IMPORTANT graphs or tables in your output)?
Model effectiveness
The model was highly effective at classifying unknown penguin species, achieving near perfect accuracy and AUC values.
We performed extensive hyperparameter tuning and cross-validation testing. We found multiple combinations yielded similar results.
Assumptions
What are your assumptions and limitations? Did you include any robustness checks?
Assumptions made throughout the analysis
- First assumption made was that neural network is the “proper” choice; likely there are simpler models that provide other benefits that could be just as robust. However, we’re looking to practice.
- Next assumption was that imputation of missing values was feasible. Because we know these are like-species and the values can’t actually be missing or non-existent, it felt reasonable to impute via kNN.
Robustness checks
- We performed extensive resampling over a long hyperparameter tuning grid and multiple cross validation folds.
- We balanced the data using SMOTE to see if results improved/changed. They did not, therefore confidence in results was increased.
- Though not completed in this workbook, another way to check robustness would have been classification via a random forest model. Assuming we got highly performant AUC and F-measure values, we would be extra confident in the results.
Assignment Workflow
Analysis Prep
Loading packages
Loading the data
We’ll start off by referencing the “Palmer Penguin” data for the assignment from the {modeldata} package.
peng_raw <- modeldata::penguins # Data for the assignmentWith it loaded into the session, let’s get a sense for what we’re working with.
Data set inspection
Right away, I like to get acquainted with the data set. That means understanding what each column seeks to describe, confirming the granularity of the rows, and getting my arms around structure, completeness, distribution, etc.
Posit, the company behind {modeldata}, did not include a data dictionary; however, the source package, {palmerpenguins}, does feature a data dictionary. It is included below:
gt(peng_data_dict) %>% # Create a "great tables" (gt) object
cols_label( # Rename some columns
variable = "Variable Name",
datatype = "Data Type",
description = "Variable Description"
) %>%
tab_options(
column_labels.background.color = "#d9230f", # Apply red header background
column_labels.font.weight = "bold", # Bold headers
row.striping.background = '#FFFFFF' # Remove row striping
)| Variable Name | Data Type | Variable Description |
|---|---|---|
| species | <factor> | Penguin species (Adelie, Chinstrap, Gentoo) |
| island | <factor> | Island in Palmer Archipelago, Antartica (biscoe, Dream, or Torgerssen) |
| bill_length_mm | <numeric> | Bill length (millimeters) |
| bill_depth_mm | <numeric> | Bill depth (millimeters) |
| flipper_length_mm | <integer> | Flipper length (millimeters) |
| body_mass_g | <integer> | Body mass (grams) |
| sex | <factor> | Penguin sex (female, male) |
Using the {skimr} package, we can get a comprehensive summary of the data.
skim(peng_raw)| Name | peng_raw |
| Number of rows | 344 |
| Number of columns | 7 |
| _______________________ | |
| Column type frequency: | |
| factor | 3 |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| species | 0 | 1.00 | FALSE | 3 | Ade: 152, Gen: 124, Chi: 68 |
| island | 0 | 1.00 | FALSE | 3 | Bis: 168, Dre: 124, Tor: 52 |
| sex | 11 | 0.97 | FALSE | 2 | mal: 168, fem: 165 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| bill_length_mm | 2 | 0.99 | 43.92 | 5.46 | 32.1 | 39.23 | 44.45 | 48.5 | 59.6 | ▃▇▇▆▁ |
| bill_depth_mm | 2 | 0.99 | 17.15 | 1.97 | 13.1 | 15.60 | 17.30 | 18.7 | 21.5 | ▅▅▇▇▂ |
| flipper_length_mm | 2 | 0.99 | 200.92 | 14.06 | 172.0 | 190.00 | 197.00 | 213.0 | 231.0 | ▂▇▃▅▂ |
| body_mass_g | 2 | 0.99 | 4201.75 | 801.95 | 2700.0 | 3550.00 | 4050.00 | 4750.0 | 6300.0 | ▃▇▆▃▂ |
Initial observations include:
- We have 344 rows and 7 columns; while a fairly small dataset, our neural network should be able to fit pretty fast
- We have a couple missing values, namely
sex(11 missing observations) and all of our numerics (2 missing observations); this means we’ll have to do some imputation - We won’t really have any skewed distributions
- All our variables appear to be in the right data type
- There’s not a significant imbalance in classes
sexis about evenspeciesandislandhave roughly 45-35-20 balancing each; one robustness check we can make later is to balancespeciesin the training data
Preprocessing
Data cleaning
We’ve already mentioned the need to impute some values, specifically with sex, bill_length_mm, bill_depth_mm, flipper_length_mm, and body_mass_g.
Given the narrow scope of this data and some finite categories like species and island, we should be able to impute using nearest neighbors instead of defaulting to the “mean” (in the case of the numerics) or “majority class” (in the case of sex). We’ll use the {VIM} package and its kNN() function:
peng_imputed <- kNN(
peng_raw,
variable = c("sex", "bill_length_mm", "bill_depth_mm", "flipper_length_mm", "body_mass_g"),
k = 5
)What we’ve said here is we’re going to impute missing values for the 5 variables using the 5 nearest neighbors to each missing value. Let’s double check peng_imputed has no missing values:
sapply(peng_imputed, function(x) sum(is.na(x))) species island bill_length_mm
0 0 0
bill_depth_mm flipper_length_mm body_mass_g
0 0 0
sex sex_imp bill_length_mm_imp
0 0 0
bill_depth_mm_imp flipper_length_mm_imp body_mass_g_imp
0 0 0
Perfect! Now, let’s peak at the missing values for column:
get_imputed <- function(x) {
peng_imputed |>
filter({{ x }}) |>
select(-c(sex_imp:body_mass_g_imp))
}print(get_imputed(sex_imp)) species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
1 Adelie Torgersen 41.8 19.4 195 4300
2 Adelie Torgersen 34.1 18.1 193 3475
3 Adelie Torgersen 42.0 20.2 190 4250
4 Adelie Torgersen 37.8 17.1 186 3300
5 Adelie Torgersen 37.8 17.3 180 3700
6 Adelie Dream 37.5 18.9 179 2975
7 Gentoo Biscoe 44.5 14.3 216 4100
8 Gentoo Biscoe 46.2 14.4 214 4650
9 Gentoo Biscoe 47.3 13.8 216 4725
10 Gentoo Biscoe 44.5 15.7 217 4875
11 Gentoo Biscoe 52.1 16.0 222 5550
sex
1 male
2 female
3 male
4 female
5 female
6 female
7 female
8 female
9 female
10 male
11 male
Looks like this predicted 4 male and 7 female. These missing values were all from two species and mostly two islands. A nearest neighbor approach in the data was a good idea since the rate of missing values wasn’t consistent across all groups and distributions.
print(get_imputed(bill_length_mm_imp)) species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
1 Adelie Torgersen 41.8 19.4 195 4300
2 Gentoo Biscoe 52.1 16.0 222 5550
sex
1 male
2 male
We imputed two missing values for bill_length_mm. These observations were from completely diferent species and islands. The mean value we may have imputed is 43.9, which likely would have lost some relationship value. The conditional mean by species at 38.8 and 47.5 do confirm our spread is about right. Unknown if that would have been a better value. We could test a model’s sensitivity to this imputation later.
We could do the same for each of the remaining instances for which we imputed data. But this is satisfactory enough. Now, we’ll just remove the flags and save the variables:
peng_clean <- peng_imputed |> select(-c(sex_imp:body_mass_g_imp))Balancing
We don’t have “significant” imbalance in our target class, but a 45-35-20 isn’t ideal either. It’s possible a neural network won’t be sensitive to the imbalance. We’ll test this as part of our robustness checks so we’ll generate another set of data balanced with Synthetic Minority Oversampling Technique or SMOTE.
We’ll save this for later when we generate a “recipe”. We’ll use the {themis} package via step_smote().
Model resources
Splitting for training/testing sets
As is customary, we must split our data into training and testing sets. We’ll put aside the testing set and work with training until we’re comfortable with our model definition. The {rsample} package has a lot of helpful functions for this workflow. In just a couple lines we get training and testing sets.
split_obj <- initial_split(peng_clean, prop = 0.75) # Split object
peng_train <- training(split_obj) # Split for training
peng_test <- testing(split_obj) # Split for testingTraining and tuning
Recipe
We’re going to practice using the {tidymodels} framework for this modeling exercise. There’s a lot of neat features we can make use of, first of which is a “recipe”. A recipe will define the inputs/outputs, as well as any preprocessing necessary for the model.
This is our main recipe:
preg_recipe <- recipe(species ~ ., peng_train) |>
step_dummy(island, sex)The recipe now knows what our modeling formula is and is prepped for use of the training data. Let’s also generate a recipe featuring the step_smote():
preg_recipe_02 <- recipe(species ~ ., peng_train) |>
step_dummy(island, sex) |>
step_smote(species, over_ratio = 0.855)Next, we’ll set up some cross validation.
Cross validation tuning
Cross validation is important to ensure the results of a single model aren’t an outlier. We’d hate to get a singular amazing result by chance that didn’t align with the true range of performance. For the neural net, we want to try different tuning values; for “nnet” specifically, we’re going to tune hidden_units (the nodes in the hidden layer), penalty (amount of regularization), and epochs (or number of cycles the model will run during fitting).
We’ll generate a tuning grid for these hyperparameters (a table with the various combinations we can use for each model):
peng_tune_grid <- grid_regular(
hidden_units(),
penalty(),
epochs(),
levels = 4
)
head(peng_tune_grid)# A tibble: 6 × 3
hidden_units penalty epochs
<int> <dbl> <int>
1 1 0.0000000001 10
2 4 0.0000000001 10
3 7 0.0000000001 10
4 10 0.0000000001 10
5 1 0.000000215 10
6 4 0.000000215 10
Let’s now create our cross validation folds. We set v = 5 to have 5 folds. This means we’ll run a model for each combination of hyperparameters 5 times. The average (or mean) performance metric will be our result:
peng_folds <- rsample::vfold_cv(peng_train, v = 5)
peng_folds# 5-fold cross-validation
# A tibble: 5 × 2
splits id
<list> <chr>
1 <split [206/52]> Fold1
2 <split [206/52]> Fold2
3 <split [206/52]> Fold3
4 <split [207/51]> Fold4
5 <split [207/51]> Fold5
Model setup
Let’s begin the modeling phase by setting up the model definition. We specify this model will be a classification neural network using nnet and tuning hidden_units, penalty, and epochs. This is nicely explained for us in the “translation”.
nn_mod01 <- mlp(
hidden_units = tune(),
penalty = tune(),
epochs = tune()
) |>
set_engine("nnet") |>
set_mode("classification")
nn_mod01 |> translate()Single Layer Neural Network Model Specification (classification)
Main Arguments:
hidden_units = tune()
penalty = tune()
epochs = tune()
Computational engine: nnet
Model fit template:
nnet::nnet(formula = missing_arg(), data = missing_arg(), size = tune(),
decay = tune(), maxit = tune(), trace = FALSE, linout = FALSE)
We’ll reuse this same definition for an imbalanced and balanced cross-validated training set. A “workflow” will tie these pieces together.
Model training
Workflow configuration
We next define a workflow that will take all these pieces and run them cohesively. We have a 1) recipe, 2) cross validation folds, 3) model, and 4) a hyperparameter tuning grid. Let’s integrate them.
peng_wflw <-
workflow() |>
add_model(nn_mod01) |>
add_recipe(preg_recipe)Next, we’ll tune the model like so (we’ll setup parallel processing to make the tuning process go quicker):
doParallel::registerDoParallel(cores = 3)
set.seed(2015)
peng_tune <-
peng_wflw |>
tune_grid(
resamples = peng_folds,
grid = peng_tune_grid,
metrics = metric_set(roc_auc, f_meas)
)The model will calculate both ROC/AUC and F-measure values which will help in determining the “best model” based on hyperparameters used.
Tuning evaluation
It’s time to see how the model did. We “collect metrics” and then arrange them in order
results_tune <- collect_metrics(peng_tune) |> arrange(desc(mean))
head(results_tune)# A tibble: 6 × 9
hidden_units penalty epochs .metric .estimator mean n std_err .config
<int> <dbl> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
1 10 0.000464 1000 roc_auc hand_till 1 5 0 Preproce…
2 4 1 340 roc_auc hand_till 1.00 5 0.000358 Preproce…
3 7 1 340 roc_auc hand_till 1.00 5 0.000358 Preproce…
4 10 1 340 roc_auc hand_till 1.00 5 0.000358 Preproce…
5 4 1 670 roc_auc hand_till 1.00 5 0.000358 Preproce…
6 7 1 670 roc_auc hand_till 1.00 5 0.000358 Preproce…
It looks like the top models had both an AUC and F-measure near 1. That indicates the model is practically perfect at classifying unseen penguin species based on what it has learned from the relationship between species and the independent variables.
If we look at the bottom of the data, we see instances where the performance measures are near 0.5 (or no better than random/majority class assignment).
tail(results_tune)# A tibble: 6 × 9
hidden_units penalty epochs .metric .estimator mean n std_err .config
<int> <dbl> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
1 7 0.000000215 1000 roc_auc hand_till 0.5 5 0 Prepro…
2 10 0.000000215 1000 roc_auc hand_till 0.5 5 0 Prepro…
3 1 0.000464 1000 roc_auc hand_till 0.5 5 0 Prepro…
4 7 1 10 roc_auc hand_till 0.499 5 0.0416 Prepro…
5 1 1 340 roc_auc hand_till 0.456 5 0.0161 Prepro…
6 10 1 10 roc_auc hand_till 0.359 5 0.0375 Prepro…
The hyperparameters for these poor models make sense: very few hidden layers, virtually no penalty, and very few iterations. Let’s see if we can’t visualze the intersection of these results:
ggplot(
results_tune,
aes(hidden_units, penalty, color = .metric, size = mean)
) +
geom_jitter(alpha = 0.6)ggplot(
results_tune,
aes(epochs, penalty, color = .metric, size = mean)
) +
geom_jitter(alpha = 0.6)ggplot(
results_tune,
aes(epochs, hidden_units, color = .metric, size = mean)
) +
geom_jitter(alpha = 0.6)These aren’t perfect visuals but they do get to the main takeaway: there’s no single, magic combination of hyperparameters. Any with moderate to high values across each will do just fine.
If we take the averages of the top models, we get the following:
results_tune |>
filter(mean == max(mean)) |>
summarise(
hidden_units_avg = median(hidden_units),
penalty_avg = mean(penalty),
epochs_avg = median(epochs)
)# A tibble: 1 × 3
hidden_units_avg penalty_avg epochs_avg
<int> <dbl> <int>
1 10 0.000464 1000
This combination of hyperparameter values is probably just as good as any we found among the best models.
Balanced data set
We can do the same thing with the balanced data. In theory, we should get similar to even better results. We need a new workflow using the same model but with the second recipe:
peng_wflw_02 <-
workflow() |>
add_model(nn_mod01) |>
add_recipe(preg_recipe_02)Now we tune the model’s hyperparameters:
peng_tune_02 <-
peng_wflw_02 |>
tune_grid(
resamples = peng_folds,
grid = peng_tune_grid,
metrics = metric_set(roc_auc, f_meas)
)Balanced tuning evaluation
results_tune02 <- collect_metrics(peng_tune_02) |> arrange(desc(mean))
head(results_tune02)# A tibble: 6 × 9
hidden_units penalty epochs .metric .estimator mean n std_err .config
<int> <dbl> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
1 7 1 340 roc_auc hand_till 1.00 5 0.000358 Preproces…
2 10 1 340 roc_auc hand_till 1.00 5 0.000358 Preproces…
3 4 1 670 roc_auc hand_till 1.00 5 0.000358 Preproces…
4 7 1 670 roc_auc hand_till 1.00 5 0.000358 Preproces…
5 10 1 670 roc_auc hand_till 1.00 5 0.000358 Preproces…
6 4 1 1000 roc_auc hand_till 1.00 5 0.000358 Preproces…
As it turns out, a neural network model on the balanced data is resulting just as performant as the imbalanced data. This is some confirmation of the model’s robustness.
Final prediction
Model confirmation
Having thoroughly explored and tested models up to this point, let’s confirm what we’d like to proceed with.
#peng_wflw <- update_model(peng_wflw, nn_mod01)
final_fit <- peng_wflw |>
finalize_workflow(tibble(
hidden_units = 7,
penalty = 1.0,
epochs = 670
)) |>
last_fit(split = split_obj)final_fit |> collect_metrics() |> head(2)# A tibble: 2 × 4
.metric .estimator .estimate .config
<chr> <chr> <dbl> <chr>
1 accuracy multiclass 1 Preprocessor1_Model1
2 roc_auc hand_till 1 Preprocessor1_Model1
Our finalized model on the last fit indicates a very performant model. Near perfect classification of unknown species.