Neural Network | Palmer Penguin Data

Leveraging neural network regression in R to predict species of penguin

Author

Adam Bushman

Published

November 25, 2024

Assignment Questions

Name

What is your name? Include all team members if submitting as a group.

Adam Bushman [u6049169]; no group members.

Perspective

From what perspective ar you conducting the analysis? (Who are you? / Who are you working for?)

I am a researcher working to classify some penguin specie observations as part of a research project on the heals of the Palmer Archipelago expedition from which the source data was collected.

Question

What is your question?

Using the source data from the Palmer Archipelago, can our research group develop an accurate neural network model for classifying species of penguin?

This is important to our research group in efforts to personalize protection efforts by species.

Data set

Describe your dataset(s) including URL (if available)

Data sourced from Posit via their {modeldata} package, accessed November 25th, 2024. The data was originally sourced from {palmerpenguins} package. Described as “A data set from Gorman, Williams, and Fraser (2014) containing measurements from different types of penguins.” Full citation below:

Horst AM, Hill AP, Gorman KB (2020). palmerpenguins: Palmer Archipelago (Antarctica) penguin data. R package version 0.1.0. https://allisonhorst.github.io/palmerpenguins/. doi: 10.5281/zenodo.3960218.

Predictor(s) and target(s)

What is (are) your independent variable(s) and dependent variable(s)? Include variable type (binary, categorical, numeric).

The dependent variable (target) for this use-case to this situation is species, a categorical feature. The remaining variables are suitable as independent variables (predictors):

Numeric: bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g
Categorical: island and sex

For a complete description of each feature, navigate to the data dictionary.

Model resonability

How are your variables suitable for your analysis method?

The variables chosen above are the extent of the dataset (all features used). The analysis method is for a neural network classification model. These variables and the analysis method are a natural pairing thanks to:

Multiclass classification
- There are likely to be some complex relationships to explain differences in species
Imbalanced data
- While we can and did test our results with balanced data, a neural network is more insensitive to this issue

Conclusions

What are your conclusions (include references to one or two CLEARLY INDICATED AND IMPORTANT graphs or tables in your output)?

Model effectiveness

The model was highly effective at classifying unknown penguin species, achieving near perfect accuracy and AUC values.

We performed extensive hyperparameter tuning and cross-validation testing. We found multiple combinations yielded similar results.

Assumptions

What are your assumptions and limitations? Did you include any robustness checks?

Assumptions made throughout the analysis

First assumption made was that neural network is the “proper” choice; likely there are simpler models that provide other benefits that could be just as robust. However, we’re looking to practice.
Next assumption was that imputation of missing values was feasible. Because we know these are like-species and the values can’t actually be missing or non-existent, it felt reasonable to impute via kNN.

Robustness checks

We performed extensive resampling over a long hyperparameter tuning grid and multiple cross validation folds.
We balanced the data using SMOTE to see if results improved/changed. They did not, therefore confidence in results was increased.
Though not completed in this workbook, another way to check robustness would have been classification via a random forest model. Assuming we got highly performant AUC and F-measure values, we would be extra confident in the results.

Assignment Workflow

Analysis Prep

Loading packages

library('tidyverse')        # Wrapper for many convenient libraries
library('tidymodels')       # Wrapper for modeling libraries
library('modeldata')        # Contains data for the assignment
library('skimr')            # Quickly summarise data
library('gt')               # Render "great tables"

library('nnet')             # Loading neural network library
library('VIM')              # Nearest neighbors imputation
library('themis')           # Library for generating a balanced data set ROSE
library('caret')            # Miscellaneous modeling functions

Loading the data

We’ll start off by referencing the “Palmer Penguin” data for the assignment from the {modeldata} package.

peng_raw <- modeldata::penguins        # Data for the assignment

With it loaded into the session, let’s get a sense for what we’re working with.

Data set inspection

Right away, I like to get acquainted with the data set. That means understanding what each column seeks to describe, confirming the granularity of the rows, and getting my arms around structure, completeness, distribution, etc.

Posit, the company behind {modeldata}, did not include a data dictionary; however, the source package, {palmerpenguins}, does feature a data dictionary. It is included below:

gt(peng_data_dict) %>%                                  # Create a "great tables" (gt) object
    cols_label(                                         # Rename some columns
        variable = "Variable Name", 
        datatype = "Data Type", 
        description = "Variable Description"
    ) %>%
    tab_options(
        column_labels.background.color = "#d9230f",     # Apply red header background
        column_labels.font.weight = "bold",             # Bold headers
        row.striping.background = '#FFFFFF'             # Remove row striping
    )

Variable Name	Data Type	Variable Description
species	<factor>	Penguin species (Adelie, Chinstrap, Gentoo)
island	<factor>	Island in Palmer Archipelago, Antartica (biscoe, Dream, or Torgerssen)
bill_length_mm	<numeric>	Bill length (millimeters)
bill_depth_mm	<numeric>	Bill depth (millimeters)
flipper_length_mm	<integer>	Flipper length (millimeters)
body_mass_g	<integer>	Body mass (grams)
sex	<factor>	Penguin sex (female, male)

Using the {skimr} package, we can get a comprehensive summary of the data.

skim(peng_raw)

Data summary
Name	peng_raw
Number of rows	344
Number of columns	7
_______________________
Column type frequency:
factor	3
numeric	4
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
species	0	1.00	FALSE	3	Ade: 152, Gen: 124, Chi: 68
island	0	1.00	FALSE	3	Bis: 168, Dre: 124, Tor: 52
sex	11	0.97	FALSE	2	mal: 168, fem: 165

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
bill_length_mm	2	0.99	43.92	5.46	32.1	39.23	44.45	48.5	59.6	▃▇▇▆▁
bill_depth_mm	2	0.99	17.15	1.97	13.1	15.60	17.30	18.7	21.5	▅▅▇▇▂
flipper_length_mm	2	0.99	200.92	14.06	172.0	190.00	197.00	213.0	231.0	▂▇▃▅▂
body_mass_g	2	0.99	4201.75	801.95	2700.0	3550.00	4050.00	4750.0	6300.0	▃▇▆▃▂

Initial observations include:

We have 344 rows and 7 columns; while a fairly small dataset, our neural network should be able to fit pretty fast
We have a couple missing values, namely sex (11 missing observations) and all of our numerics (2 missing observations); this means we’ll have to do some imputation
We won’t really have any skewed distributions
All our variables appear to be in the right data type
There’s not a significant imbalance in classes
- sex is about even
- species and island have roughly 45-35-20 balancing each; one robustness check we can make later is to balance species in the training data

Preprocessing

Data cleaning

We’ve already mentioned the need to impute some values, specifically with sex, bill_length_mm, bill_depth_mm, flipper_length_mm, and body_mass_g.

Given the narrow scope of this data and some finite categories like species and island, we should be able to impute using nearest neighbors instead of defaulting to the “mean” (in the case of the numerics) or “majority class” (in the case of sex). We’ll use the {VIM} package and its kNN() function:

peng_imputed <- kNN(
    peng_raw, 
    variable = c("sex", "bill_length_mm", "bill_depth_mm", "flipper_length_mm", "body_mass_g"), 
    k = 5
)

What we’ve said here is we’re going to impute missing values for the 5 variables using the 5 nearest neighbors to each missing value. Let’s double check peng_imputed has no missing values:

sapply(peng_imputed, function(x) sum(is.na(x)))

              species                island        bill_length_mm 
                    0                     0                     0 
        bill_depth_mm     flipper_length_mm           body_mass_g 
                    0                     0                     0 
                  sex               sex_imp    bill_length_mm_imp 
                    0                     0                     0 
    bill_depth_mm_imp flipper_length_mm_imp       body_mass_g_imp 
                    0                     0                     0

Perfect! Now, let’s peak at the missing values for column:

get_imputed <- function(x) {
    peng_imputed |> 
        filter({{ x }}) |>
        select(-c(sex_imp:body_mass_g_imp))
}

print(get_imputed(sex_imp))

   species    island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
1   Adelie Torgersen           41.8          19.4               195        4300
2   Adelie Torgersen           34.1          18.1               193        3475
3   Adelie Torgersen           42.0          20.2               190        4250
4   Adelie Torgersen           37.8          17.1               186        3300
5   Adelie Torgersen           37.8          17.3               180        3700
6   Adelie     Dream           37.5          18.9               179        2975
7   Gentoo    Biscoe           44.5          14.3               216        4100
8   Gentoo    Biscoe           46.2          14.4               214        4650
9   Gentoo    Biscoe           47.3          13.8               216        4725
10  Gentoo    Biscoe           44.5          15.7               217        4875
11  Gentoo    Biscoe           52.1          16.0               222        5550
      sex
1    male
2  female
3    male
4  female
5  female
6  female
7  female
8  female
9  female
10   male
11   male

Looks like this predicted 4 male and 7 female. These missing values were all from two species and mostly two islands. A nearest neighbor approach in the data was a good idea since the rate of missing values wasn’t consistent across all groups and distributions.

print(get_imputed(bill_length_mm_imp))

  species    island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
1  Adelie Torgersen           41.8          19.4               195        4300
2  Gentoo    Biscoe           52.1          16.0               222        5550
   sex
1 male
2 male

We imputed two missing values for bill_length_mm. These observations were from completely diferent species and islands. The mean value we may have imputed is 43.9, which likely would have lost some relationship value. The conditional mean by species at 38.8 and 47.5 do confirm our spread is about right. Unknown if that would have been a better value. We could test a model’s sensitivity to this imputation later.

We could do the same for each of the remaining instances for which we imputed data. But this is satisfactory enough. Now, we’ll just remove the flags and save the variables:

peng_clean <- peng_imputed |> select(-c(sex_imp:body_mass_g_imp))

Balancing

We don’t have “significant” imbalance in our target class, but a 45-35-20 isn’t ideal either. It’s possible a neural network won’t be sensitive to the imbalance. We’ll test this as part of our robustness checks so we’ll generate another set of data balanced with Synthetic Minority Oversampling Technique or SMOTE.

We’ll save this for later when we generate a “recipe”. We’ll use the {themis} package via step_smote().

Model resources

Splitting for training/testing sets

As is customary, we must split our data into training and testing sets. We’ll put aside the testing set and work with training until we’re comfortable with our model definition. The {rsample} package has a lot of helpful functions for this workflow. In just a couple lines we get training and testing sets.

split_obj <- initial_split(peng_clean, prop = 0.75)              # Split object

peng_train <- training(split_obj)                                # Split for training
peng_test <- testing(split_obj)                                  # Split for testing

Training and tuning

Recipe

We’re going to practice using the {tidymodels} framework for this modeling exercise. There’s a lot of neat features we can make use of, first of which is a “recipe”. A recipe will define the inputs/outputs, as well as any preprocessing necessary for the model.

This is our main recipe:

preg_recipe <- recipe(species ~ ., peng_train) |>
    step_dummy(island, sex)

The recipe now knows what our modeling formula is and is prepped for use of the training data. Let’s also generate a recipe featuring the step_smote():

preg_recipe_02 <- recipe(species ~ ., peng_train) |>
    step_dummy(island, sex) |>
    step_smote(species, over_ratio = 0.855)

Next, we’ll set up some cross validation.

Cross validation tuning

Cross validation is important to ensure the results of a single model aren’t an outlier. We’d hate to get a singular amazing result by chance that didn’t align with the true range of performance. For the neural net, we want to try different tuning values; for “nnet” specifically, we’re going to tune hidden_units (the nodes in the hidden layer), penalty (amount of regularization), and epochs (or number of cycles the model will run during fitting).

We’ll generate a tuning grid for these hyperparameters (a table with the various combinations we can use for each model):

peng_tune_grid <- grid_regular(
    hidden_units(), 
    penalty(), 
    epochs(), 
    levels = 4
)

head(peng_tune_grid)

# A tibble: 6 × 3
  hidden_units      penalty epochs
         <int>        <dbl>  <int>
1            1 0.0000000001     10
2            4 0.0000000001     10
3            7 0.0000000001     10
4           10 0.0000000001     10
5            1 0.000000215      10
6            4 0.000000215      10

Let’s now create our cross validation folds. We set v = 5 to have 5 folds. This means we’ll run a model for each combination of hyperparameters 5 times. The average (or mean) performance metric will be our result:

peng_folds <- rsample::vfold_cv(peng_train, v = 5)
peng_folds

#  5-fold cross-validation 
# A tibble: 5 × 2
  splits           id   
  <list>           <chr>
1 <split [206/52]> Fold1
2 <split [206/52]> Fold2
3 <split [206/52]> Fold3
4 <split [207/51]> Fold4
5 <split [207/51]> Fold5

Model setup

Let’s begin the modeling phase by setting up the model definition. We specify this model will be a classification neural network using nnet and tuning hidden_units, penalty, and epochs. This is nicely explained for us in the “translation”.

nn_mod01 <- mlp(
    hidden_units = tune(), 
    penalty = tune(), 
    epochs = tune()
) |>
    set_engine("nnet") |>
    set_mode("classification")

nn_mod01 |> translate()

Single Layer Neural Network Model Specification (classification)

Main Arguments:
  hidden_units = tune()
  penalty = tune()
  epochs = tune()

Computational engine: nnet 

Model fit template:
nnet::nnet(formula = missing_arg(), data = missing_arg(), size = tune(), 
    decay = tune(), maxit = tune(), trace = FALSE, linout = FALSE)

We’ll reuse this same definition for an imbalanced and balanced cross-validated training set. A “workflow” will tie these pieces together.

Model training

Workflow configuration

We next define a workflow that will take all these pieces and run them cohesively. We have a 1) recipe, 2) cross validation folds, 3) model, and 4) a hyperparameter tuning grid. Let’s integrate them.

peng_wflw <-
    workflow() |>
    add_model(nn_mod01) |>
    add_recipe(preg_recipe)

Next, we’ll tune the model like so (we’ll setup parallel processing to make the tuning process go quicker):

doParallel::registerDoParallel(cores = 3)

set.seed(2015)
peng_tune <-
    peng_wflw |>
    tune_grid(
        resamples = peng_folds, 
        grid = peng_tune_grid, 
        metrics = metric_set(roc_auc, f_meas)
    )

The model will calculate both ROC/AUC and F-measure values which will help in determining the “best model” based on hyperparameters used.

Tuning evaluation

It’s time to see how the model did. We “collect metrics” and then arrange them in order

results_tune <- collect_metrics(peng_tune) |> arrange(desc(mean))
head(results_tune)

# A tibble: 6 × 9
  hidden_units  penalty epochs .metric .estimator  mean     n  std_err .config  
         <int>    <dbl>  <int> <chr>   <chr>      <dbl> <int>    <dbl> <chr>    
1           10 0.000464   1000 roc_auc hand_till   1        5 0        Preproce…
2            4 1           340 roc_auc hand_till   1.00     5 0.000358 Preproce…
3            7 1           340 roc_auc hand_till   1.00     5 0.000358 Preproce…
4           10 1           340 roc_auc hand_till   1.00     5 0.000358 Preproce…
5            4 1           670 roc_auc hand_till   1.00     5 0.000358 Preproce…
6            7 1           670 roc_auc hand_till   1.00     5 0.000358 Preproce…

It looks like the top models had both an AUC and F-measure near 1. That indicates the model is practically perfect at classifying unseen penguin species based on what it has learned from the relationship between species and the independent variables.

If we look at the bottom of the data, we see instances where the performance measures are near 0.5 (or no better than random/majority class assignment).

tail(results_tune)

# A tibble: 6 × 9
  hidden_units     penalty epochs .metric .estimator  mean     n std_err .config
         <int>       <dbl>  <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>  
1            7 0.000000215   1000 roc_auc hand_till  0.5       5  0      Prepro…
2           10 0.000000215   1000 roc_auc hand_till  0.5       5  0      Prepro…
3            1 0.000464      1000 roc_auc hand_till  0.5       5  0      Prepro…
4            7 1               10 roc_auc hand_till  0.499     5  0.0416 Prepro…
5            1 1              340 roc_auc hand_till  0.456     5  0.0161 Prepro…
6           10 1               10 roc_auc hand_till  0.359     5  0.0375 Prepro…

The hyperparameters for these poor models make sense: very few hidden layers, virtually no penalty, and very few iterations. Let’s see if we can’t visualze the intersection of these results:

ggplot(
    results_tune, 
    aes(hidden_units, penalty, color = .metric, size = mean)
) +
    geom_jitter(alpha = 0.6)

ggplot(
    results_tune, 
    aes(epochs, penalty, color = .metric, size = mean)
) +
    geom_jitter(alpha = 0.6)

ggplot(
    results_tune, 
    aes(epochs, hidden_units, color = .metric, size = mean)
) +
    geom_jitter(alpha = 0.6)

These aren’t perfect visuals but they do get to the main takeaway: there’s no single, magic combination of hyperparameters. Any with moderate to high values across each will do just fine.

If we take the averages of the top models, we get the following:

results_tune |>
    filter(mean == max(mean)) |>
    summarise(
        hidden_units_avg = median(hidden_units), 
        penalty_avg = mean(penalty), 
        epochs_avg = median(epochs)
    )

# A tibble: 1 × 3
  hidden_units_avg penalty_avg epochs_avg
             <int>       <dbl>      <int>
1               10    0.000464       1000

This combination of hyperparameter values is probably just as good as any we found among the best models.

Balanced data set

We can do the same thing with the balanced data. In theory, we should get similar to even better results. We need a new workflow using the same model but with the second recipe:

peng_wflw_02 <-
    workflow() |>
    add_model(nn_mod01) |>
    add_recipe(preg_recipe_02)

Now we tune the model’s hyperparameters:

peng_tune_02 <-
    peng_wflw_02 |>
    tune_grid(
        resamples = peng_folds, 
        grid = peng_tune_grid, 
        metrics = metric_set(roc_auc, f_meas)
    )

Balanced tuning evaluation

results_tune02 <- collect_metrics(peng_tune_02) |> arrange(desc(mean))

head(results_tune02)

# A tibble: 6 × 9
  hidden_units penalty epochs .metric .estimator  mean     n  std_err .config   
         <int>   <dbl>  <int> <chr>   <chr>      <dbl> <int>    <dbl> <chr>     
1            7       1    340 roc_auc hand_till   1.00     5 0.000358 Preproces…
2           10       1    340 roc_auc hand_till   1.00     5 0.000358 Preproces…
3            4       1    670 roc_auc hand_till   1.00     5 0.000358 Preproces…
4            7       1    670 roc_auc hand_till   1.00     5 0.000358 Preproces…
5           10       1    670 roc_auc hand_till   1.00     5 0.000358 Preproces…
6            4       1   1000 roc_auc hand_till   1.00     5 0.000358 Preproces…

As it turns out, a neural network model on the balanced data is resulting just as performant as the imbalanced data. This is some confirmation of the model’s robustness.

Final prediction

Model confirmation

Having thoroughly explored and tested models up to this point, let’s confirm what we’d like to proceed with.

#peng_wflw <- update_model(peng_wflw, nn_mod01)

final_fit <- peng_wflw |> 
    finalize_workflow(tibble(
        hidden_units = 7, 
        penalty = 1.0, 
        epochs = 670
    )) |> 
    last_fit(split = split_obj)

final_fit |> collect_metrics() |> head(2)

# A tibble: 2 × 4
  .metric  .estimator .estimate .config             
  <chr>    <chr>          <dbl> <chr>               
1 accuracy multiclass         1 Preprocessor1_Model1
2 roc_auc  hand_till          1 Preprocessor1_Model1

Our finalized model on the last fit indicates a very performant model. Near perfect classification of unknown species.