Redefining NBA positions and classifying incoming NBA prospects

Leveraging multiple machine learning methods in R to derive new NBA positions based on player roles/syles and using such to classify incoming NBA prospects.
Author

Adam Bushman

Published

December 15, 2024

Prompts

Prompt #1

Which machine learning methods did you implement?

I chose to implement the following machine learning methods in my final project:

  1. Principal Component Analysis (PCA)
  2. KMeans Clustering
  3. Hierarchical Clustering
  4. Elastic Net Logistic Regression
  5. Neural Network
  6. Boosted Trees

Prompt #2

Discuss the key contribution of each method to your analysis. If a method didn’t contribute, discuss why it didn’t. A sentence or two for each method is plenty.

  1. Principal Component Analysis (PCA): Used for the purpose of more performant clustering. Reduced dimensionality of source data from 195 features to 57 principal components while retaining 95% of overall variance.

  2. KMeans Clustering: Used for the purpose of determing the proper number of clusters. Used the “within cluster sum of squares” measure to evaluate a range of cluster numbers, finding the optimal number via plotting for the elbow point. This was the method used for cluster assignments.

  3. Hierarchical Clustering: Used for the purpose of validating the proper number of clusters. Used the “within cluster sum of squared error” measure to evaluate a range of cluster nujmbers, finding the optimal number via elbow point.

  4. Elastic Net Logistic Regression: Used for the purpose of exploring top features that explain the unique characteristics of each cluster. Through these insights, I derived meaningful cluster labels. Tuned lambda and alpha (mixture) hyperparameters via cross-validation and minimized the performance metric Root Mean Squared Error.

  5. Neural Network: Used for the purpose of predicting cluster labels for test data. Tuned penalty, hidden_units, and epochs hyperparameters for maximizing performance metric AUC on training data. Chosen as final model for prediction.

  6. Boosted Trees: Used for the purpose of validating performance results of Neural Network model. Tuned mtry, trees, tree_depth, and learn_rate hyperparameters for maximizing performance metric AUC on training data.

Prompt #3

Did all methods support your conclusions or did some provide conflicting results? If so they provided conflicting results, how did you reconcile the differences?

  • Clustering:
    • KMeans and Hierarchical clustering were both employed to validate results.
    • Both gave similar results, but KMeans proved more consistent.
    • Hierarchical clustering via mclust saw a far more jagged and harder to read elbow point; in some instances, there were large portions of the line with marginal reductions in “WSSE” for each additional cluster.
    • KMeans had a nice and smooth, far easier to spot elbow point. With hierarchical clustering confirming KMeans was reasonable, chose to use this appraoch for assigning clusters.
  • Final Prediction Models:
    • Neural Network and XGBoost were both very performant, producing AUC values exceeding 0.81.
    • Neural Network fit across the cross-validated tuning grid MUCH faster than XGBoost, which was surprising.
    • Did not observe any further conflicting results.

Assignment Workflow

Data Prep

Load Libraries & Data Set

We’ll use the following libraries and their dependencies

library("tidyverse")            # Used for general data manipulation and visualization
library("tidymodels")           # Used for its modeling framework
library("tidyclust")            # Used for clustering approaches
library("dataPreparation")      # Used for scaling, where needed
library("kernlab")              # Weighted kernal k-means
library("FactoMineR")           # Used for PCA
library('VIM')                  # Used for training data imputation
setwd("full-projects/nba-player-position-roles/R")

Let’s import the source data:

nba_stats <- read.csv("../nba-player-data.csv")

The source data is comprized of ~200 features relating to the top 500 NBA players in minutes played over the past 5 seasons. The measures encompass offensive and defensive activity, including shooting, passing, rebounding, etc. We can review the measures below:

glimpse(nba_stats)
Rows: 500
Columns: 201
$ assists_assists                             <int> 987, 1448, 1334, 1676, 110…
$ assists_assist_points                       <int> 2404, 3675, 3345, 4160, 27…
$ assists_two_pt_assists                      <int> 557, 669, 657, 868, 586, 1…
$ assists_three_pt_assists                    <int> 430, 779, 677, 808, 519, 1…
$ assists_at_rim_assists                      <int> 339, 405, 449, 530, 348, 6…
$ assists_short_mid_range_assists             <int> 160, 193, 169, 234, 106, 2…
$ assists_long_mid_range_assists              <int> 58, 71, 39, 104, 132, 174,…
$ assists_corner3assists                      <int> 176, 264, 233, 228, 173, 2…
$ assists_arc3assists                         <int> 254, 515, 444, 580, 346, 7…
$ fouls_fouls                                 <int> 571, 658, 666, 689, 747, 1…
$ fouls_shooting_fouls                        <int> 286, 231, 268, 363, 469, 4…
$ fouls_loose_ball_fouls                      <int> 23, 33, 23, 26, 45, 141, 4…
$ fouls_loose_ball_fouls_drawn                <int> 25, 35, 14, 29, 80, 152, 4…
$ fouls_offensive_fouls                       <int> 17, 80, 34, 27, 71, 155, 6…
$ fouls_charge_fouls                          <int> 5, 51, 41, 23, 15, 45, 56,…
$ fouls_offensive_fouls_drawn                 <int> 85, 25, 13, 59, 13, 31, 29…
$ fouls_charge_fouls_drawn                    <int> 2, 0, 1, 22, 4, 7, 5, 10, …
$ fouls_clear_path_fouls                      <int> 0, 2, 2, 1, 0, 0, 0, 0, 0,…
$ fouls_transition_take_fouls                 <int> 1, 5, 4, 6, 3, 1, 3, 2, 1,…
$ fouls_transition_take_fouls_drawn           <int> 2, 4, 3, 1, 0, 2, 2, 8, 2,…
$ fouls_defensive_3_seconds_violations        <int> 0, 3, 10, 5, 2, 9, 12, 24,…
$ fouls_fouls_drawn                           <int> 829, 1471, 1084, 1681, 687…
$ free_fta                                    <int> 979, 2103, 1581, 2293, 625…
$ free_technical_free_throw_trips             <int> 31, 122, 44, 84, 22, 0, 6,…
$ free_two_pt_shooting_fouls_drawn            <int> 422, 781, 750, 1109, 273, …
$ free_x2pt_and_1_free_throw_trips            <int> 119, 172, 211, 289, 85, 20…
$ free_three_pt_shooting_fouls_drawn          <int> 23, 84, 15, 16, 0, 2, 2, 1…
$ free_x3pt_and_1_free_throw_trips            <int> 5, 12, 4, 9, 0, 0, 1, 2, 0…
$ free_non_shooting_fouls_drawn               <int> 84, 187, 107, 122, 71, 147…
$ free_shooting_fouls_drawn_pct               <dbl> 0.09357218, 0.12273441, 0.…
$ free_two_pt_shooting_fouls_drawn_pct        <dbl> 0.14506703, 0.18303258, 0.…
$ free_three_pt_shooting_fouls_drawn_pct      <dbl> 0.011982571, 0.028832117, …
$ misc_plus_minus                             <int> 918, 2130, 553, 35, -653, …
$ misc_on_off_rtg                             <dbl> 117.4043, 120.2431, 115.39…
$ misc_on_def_rtg                             <dbl> 113.4244, 111.2182, 113.10…
$ misc_first_chance_points                    <int> 5207, 7913, 7005, 7063, 54…
$ misc_three_pt_off_rebounded_pct             <dbl> 0.2243415, 0.2360437, 0.21…
$ misc_at_rim_off_rebounded_pct               <dbl> 0.3426573, 0.3629191, 0.33…
$ misc_short_mid_range_off_rebounded_pct      <dbl> 0.2957983, 0.3379501, 0.31…
$ misc_long_mid_range_off_rebounded_pct       <dbl> 0.1742424, 0.2364865, 0.21…
$ misc_blocks                                 <int> 202, 186, 196, 131, 255, 1…
$ misc_blocked2s                              <int> 192, 165, 180, 129, 249, 1…
$ misc_blocked3s                              <int> 10, 21, 16, 2, 6, 3, 18, 0…
$ misc_blocked_at_rim                         <int> 113, 76, 97, 73, 164, 92, …
$ misc_blocked_short_mid_range                <int> 72, 76, 74, 54, 83, 59, 48…
$ misc_blocked_long_mid_range                 <int> 7, 13, 9, 2, 2, 3, 1, 6, 1…
$ misc_blocked_corner3                        <int> 3, 3, 6, 1, 1, 2, 6, 0, 1,…
$ misc_blocked_arc3                           <int> 7, 18, 10, 1, 5, 1, 12, 0,…
$ misc_recovered_blocks                       <int> 120, 106, 107, 73, 157, 10…
$ misc_blocks_recovered_pct                   <dbl> 0.5940594, 0.5698925, 0.54…
$ misc_steals                                 <int> 369, 331, 445, 326, 265, 2…
$ misc_lost_ball_steals                       <int> 85, 115, 178, 101, 67, 78,…
$ misc_bad_pass_steals                        <int> 284, 216, 267, 225, 198, 2…
$ misc_defensive_goaltends                    <int> 3, 9, 6, 1, 4, 9, 5, 0, 1,…
$ profile_name                                <chr> "Mikal Bridges", "Jayson T…
$ profile_team_abbreviation                   <chr> "NYK", "BOS", "MIN", "SAC"…
$ profile_games_played                        <int> 342, 311, 325, 310, 326, 3…
$ profile_minutes                             <int> 11898, 11237, 11229, 11182…
$ profile_height_in                           <int> 78, 80, 76, 78, 82, 82, 80…
$ profile_weight_lbs                          <int> 209, 210, 225, 220, 260, 2…
$ rebounds_rebounds                           <int> 1475, 2535, 1691, 1413, 35…
$ rebounds_def_rebounds                       <int> 1156, 2243, 1457, 1230, 28…
$ rebounds_ft_def_rebounds                    <int> 53, 143, 32, 69, 221, 218,…
$ rebounds_def_ft_rebound_pct                 <dbl> 0.09397163, 0.29183673, 0.…
$ rebounds_def_two_pt_rebounds                <int> 508, 1005, 638, 569, 1432,…
$ rebounds_def_two_pt_rebound_pct             <dbl> 0.08090460, 0.17065716, 0.…
$ rebounds_def_three_pt_rebounds              <int> 595, 1095, 787, 592, 1210,…
$ rebounds_def_three_pt_rebound_pct           <dbl> 0.11639280, 0.21361686, 0.…
$ rebounds_def_fg_rebound_pct                 <dbl> 0.09683083, 0.19064911, 0.…
$ rebounds_off_rebounds                       <int> 319, 292, 234, 183, 706, 9…
$ rebounds_ft_off_rebounds                    <int> 5, 3, 8, 5, 6, 11, 8, 6, 9…
$ rebounds_off_ft_rebound_pct                 <dbl> 0.010845987, 0.006726457, …
$ rebounds_off_two_pt_rebounds                <int> 137, 147, 120, 126, 501, 6…
$ rebounds_off_two_pt_rebound_pct             <dbl> 0.02258490, 0.03072100, 0.…
$ rebounds_off_three_pt_rebounds              <int> 177, 142, 106, 52, 199, 31…
$ rebounds_off_three_pt_rebound_pct           <dbl> 0.03506339, 0.02479050, 0.…
$ rebounds_off_fg_rebound_pct                 <dbl> 0.02825265, 0.02748977, 0.…
$ rebounds_def_at_rim_rebound_pct             <dbl> 0.06641545, 0.13491635, 0.…
$ rebounds_def_short_mid_range_rebound_pct    <dbl> 0.07489598, 0.17822142, 0.…
$ rebounds_def_long_mid_range_rebound_pct     <dbl> 0.11871069, 0.20608899, 0.…
$ rebounds_def_arc3rebound_pct                <dbl> 0.11399351, 0.21528977, 0.…
$ rebounds_def_corner3rebound_pct             <dbl> 0.12511333, 0.20728291, 0.…
$ rebounds_off_at_rim_rebound_pct             <dbl> 0.02734148, 0.03848467, 0.…
$ rebounds_off_short_mid_range_rebound_pct    <dbl> 0.01847334, 0.03116279, 0.…
$ rebounds_off_long_mid_range_rebound_pct     <dbl> 0.025033829, 0.016460905, …
$ rebounds_off_arc3rebound_pct                <dbl> 0.03591009, 0.02408283, 0.…
$ rebounds_off_corner3rebound_pct             <dbl> 0.032857143, 0.027237354, …
$ rebounds_self_o_reb                         <int> 32, 103, 74, 62, 110, 154,…
$ rebounds_self_o_reb_pct                     <dbl> 0.01536246, 0.03385930, 0.…
$ scoring_off_poss                            <int> 24063, 22788, 23186, 22708…
$ scoring_points                              <int> 5792, 8598, 7527, 7597, 62…
$ scoring_fg2m                                <int> 1447, 1948, 1798, 2500, 20…
$ scoring_fg2a                                <int> 2606, 3658, 3548, 4818, 37…
$ scoring_fg2pct                              <dbl> 0.5552571, 0.5325314, 0.50…
$ scoring_fg3m                                <int> 690, 974, 892, 202, 548, 1…
$ scoring_fg3a                                <int> 1819, 2673, 2473, 626, 154…
$ scoring_fg3pct                              <dbl> 0.3793293, 0.3643846, 0.36…
$ scoring_non_heave_fg3pct                    <dbl> 0.3815061, 0.3647940, 0.36…
$ scoring_ft_points                           <int> 828, 1780, 1255, 1991, 513…
$ scoring_pts_assisted2s                      <int> 1816, 1624, 1280, 1262, 29…
$ scoring_pts_unassisted2s                    <int> 1078, 2272, 2316, 3738, 12…
$ scoring_pts_assisted3s                      <int> 1932, 1575, 1482, 525, 163…
$ scoring_pts_unassisted3s                    <int> 138, 1347, 1194, 81, 6, 18…
$ scoring_assisted2s_pct                      <dbl> 0.6275052, 0.4168378, 0.35…
$ scoring_non_putbacks_assisted2s_pct         <dbl> 0.6570188, 0.4344569, 0.36…
$ scoring_assisted3s_pct                      <dbl> 0.9333333, 0.5390144, 0.55…
$ scoring_fg3a_pct                            <dbl> 0.411073446, 0.422208182, …
$ scoring_shot_quality_avg                    <dbl> 0.5310678, 0.5169488, 0.53…
$ scoring_efg_pct                             <dbl> 0.5609040, 0.5384615, 0.52…
$ scoring_ts_pct                              <dbl> 0.5967909, 0.5903598, 0.56…
$ scoring_pts_putbacks                        <int> 130, 158, 78, 50, 442, 524…
$ scoring_fg2a_blocked                        <int> 156, 272, 239, 289, 201, 2…
$ scoring_fg2a_pct_blocked                    <dbl> 0.05986186, 0.07435757, 0.…
$ scoring_fg3a_blocked                        <int> 10, 11, 7, 6, 2, 3, 2, 2, …
$ scoring_fg3a_pct_blocked                    <dbl> 0.005497526, 0.004115226, …
$ scoring_usage                               <dbl> 19.56094, 31.35144, 29.148…
$ second_second_chance_off_poss               <int> 2532, 2479, 2497, 2241, 20…
$ second_second_chance_points                 <int> 585, 685, 522, 534, 805, 1…
$ second_second_chance_points_pct             <dbl> 0.10100138, 0.07966969, 0.…
$ second_second_chance_fg2m                   <int> 145, 172, 106, 178, 327, 3…
$ second_second_chance_fg2a                   <int> 252, 318, 238, 324, 583, 6…
$ second_second_chance_fg2pct                 <dbl> 0.5753968, 0.5408805, 0.44…
$ second_second_chance_fg3m                   <int> 79, 78, 79, 16, 23, 13, 37…
$ second_second_chance_fg3a                   <int> 181, 209, 185, 47, 77, 40,…
$ second_second_chance_fg3pct                 <dbl> 0.4364641, 0.3732057, 0.42…
$ second_second_chance_ft_points              <int> 58, 107, 73, 130, 82, 213,…
$ second_second_chance_efg_pct                <dbl> 0.6085450, 0.5483871, 0.53…
$ second_second_chance_ts_pct                 <dbl> 0.6334056, 0.5862069, 0.56…
$ second_second_chance_shot_quality_avg       <dbl> 0.5340895, 0.5291019, 0.53…
$ second_second_chance_at_rim_fgm             <int> 78, 127, 82, 59, 259, 344,…
$ second_second_chance_at_rim_fga             <int> 110, 200, 135, 87, 426, 53…
$ second_second_chance_at_rim_frequency       <dbl> 0.2540416, 0.3795066, 0.31…
$ second_second_chance_at_rim_accuracy        <dbl> 0.7090909, 0.6350000, 0.60…
$ second_second_chance_at_rim_pct_assisted    <dbl> 0.32051282, 0.25984252, 0.…
$ second_second_chance_corner3fgm             <int> 51, 17, 17, 3, 1, 2, 9, 0,…
$ second_second_chance_corner3fga             <int> 102, 44, 45, 12, 7, 4, 21,…
$ second_second_chance_corner3frequency       <dbl> 0.235565820, 0.083491461, …
$ second_second_chance_corner3accuracy        <dbl> 0.5000000, 0.3863636, 0.37…
$ second_second_chance_corner3pct_assisted    <dbl> 0.9607843, 0.8235294, 1.00…
$ second_second_chance_arc3fgm                <int> 28, 61, 62, 13, 22, 11, 28…
$ second_second_chance_arc3fga                <int> 79, 165, 140, 35, 70, 36, …
$ second_second_chance_arc3frequency          <dbl> 0.18244804, 0.31309298, 0.…
$ second_second_chance_arc3accuracy           <dbl> 0.3544304, 0.3696970, 0.44…
$ second_second_chance_arc3pct_assisted       <dbl> 0.9642857, 0.7704918, 0.69…
$ second_second_chance_turnovers              <int> 34, 43, 44, 26, 39, 73, 40…
$ shot_shot_quality_avg                       <dbl> 0.5310678, 0.5169488, 0.53…
$ shot_at_rim_fg3a_frequency                  <dbl> 0.6352542, 0.6907282, 0.71…
$ shot_avg2pt_shot_distance                   <dbl> 7.803492, 7.222635, 6.7562…
$ shot_avg3pt_shot_distance                   <dbl> 24.83178, 26.09386, 25.710…
$ shot_at_rim_fgm                             <int> 704, 1189, 1178, 715, 865,…
$ shot_at_rim_fga                             <int> 992, 1700, 1848, 1089, 128…
$ shot_at_rim_frequency                       <dbl> 0.2241808, 0.2685200, 0.30…
$ shot_at_rim_accuracy                        <dbl> 0.7096774, 0.6994118, 0.63…
$ shot_unblocked_at_rim_accuracy              <dbl> 0.7652174, 0.7612036, 0.68…
$ shot_at_rim_pct_assisted                    <dbl> 0.7301136, 0.4566863, 0.39…
$ shot_at_rim_pct_blocked                     <dbl> 0.07258064, 0.08117647, 0.…
$ shot_short_mid_range_fgm                    <int> 572, 459, 381, 851, 876, 4…
$ shot_short_mid_range_fga                    <int> 1175, 1191, 1019, 1672, 17…
$ shot_short_mid_range_frequency              <dbl> 0.26553672, 0.18812194, 0.…
$ shot_short_mid_range_accuracy               <dbl> 0.4868085, 0.3853904, 0.37…
$ shot_unblocked_short_mid_range_accuracy     <dbl> 0.5190563, 0.4305816, 0.41…
$ shot_short_mid_range_pct_assisted           <dbl> 0.54195804, 0.33551198, 0.…
$ shot_short_mid_range_pct_blocked            <dbl> 0.06212766, 0.10495382, 0.…
$ shot_long_mid_range_fgm                     <int> 171, 300, 239, 934, 315, 1…
$ shot_long_mid_range_fga                     <int> 439, 767, 681, 2057, 733, …
$ shot_long_mid_range_frequency               <dbl> 0.099209040, 0.121149897, …
$ shot_long_mid_range_accuracy                <dbl> 0.3895216, 0.3911343, 0.35…
$ shot_unblocked_long_mid_range_accuracy      <dbl> 0.3995327, 0.3957784, 0.35…
$ shot_long_mid_range_pct_assisted            <dbl> 0.4912281, 0.3833333, 0.26…
$ shot_long_mid_range_pct_blocked             <dbl> 0.025056948, 0.011734029, …
$ shot_corner3fgm                             <int> 361, 110, 104, 74, 40, 15,…
$ shot_corner3fga                             <int> 850, 277, 295, 221, 107, 4…
$ shot_corner3frequency                       <dbl> 0.192090395, 0.043752962, …
$ shot_corner3accuracy                        <dbl> 0.4247059, 0.3971119, 0.35…
$ shot_unblocked_corner3accuracy              <dbl> 0.4267139, 0.3971119, 0.35…
$ shot_corner3pct_assisted                    <dbl> 0.9806094, 0.7909091, 0.86…
$ shot_corner3pct_blocked                     <dbl> 0.004705882, 0.000000000, …
$ shot_arc3fgm                                <int> 329, 864, 788, 128, 508, 1…
$ shot_arc3fga                                <int> 969, 2396, 2178, 405, 1440…
$ shot_arc3frequency                          <dbl> 0.218983051, 0.378455220, …
$ shot_arc3accuracy                           <dbl> 0.3395253, 0.3606010, 0.36…
$ shot_unblocked_arc3accuracy                 <dbl> 0.3416407, 0.3622642, 0.36…
$ shot_arc3pct_assisted                       <dbl> 0.8814590, 0.5069444, 0.51…
$ shot_arc3pct_blocked                        <dbl> 0.006191950, 0.004590985, …
$ shot_non_heave_arc3fgm                      <int> 328, 864, 788, 128, 508, 1…
$ shot_non_heave_arc3fga                      <int> 955, 2393, 2161, 395, 1438…
$ shot_non_heave_arc3accuracy                 <dbl> 0.3434555, 0.3610531, 0.36…
$ shot_heave_attempts                         <int> 13, 3, 17, 10, 2, 1, 5, 23…
$ shot_heave_makes                            <int> 1, 0, 0, 0, 0, 0, 0, 0, 0,…
$ turnovers_turnovers                         <int> 451, 852, 922, 622, 565, 9…
$ turnovers_live_ball_turnovers               <int> 297, 464, 540, 404, 320, 5…
$ turnovers_dead_ball_turnovers               <int> 154, 388, 382, 218, 245, 4…
$ turnovers_live_ball_turnover_pct            <dbl> 0.6585366, 0.5446009, 0.58…
$ turnovers_lost_ball_turnovers               <int> 102, 203, 253, 156, 148, 2…
$ turnovers_lost_ball_out_of_bounds_turnovers <int> 38, 59, 68, 48, 21, 74, 41…
$ turnovers_bad_pass_turnovers                <int> 195, 261, 287, 248, 172, 2…
$ turnovers_bad_pass_out_of_bounds_turnovers  <int> 56, 124, 141, 63, 75, 107,…
$ turnovers_travels                           <int> 18, 44, 63, 40, 34, 72, 47…
$ turnovers_x3second_violations               <int> 0, 2, 2, 0, 9, 15, 0, 6, 1…
$ turnovers_step_out_of_bounds_turnovers      <int> 15, 12, 14, 10, 7, 5, 9, 3…
$ turnovers_offensive_goaltends               <int> 0, 1, 2, 0, 3, 1, 1, 4, 1,…

Data Quality

Let’s see if we have any missing or empty values we need to worry about:

# Check for missing or empty values
na_summary <- sapply(nba_stats, function(col) {
    sum(is.na(col) | is.null(col) | col == "", na.rm = TRUE)
})

any(na_summary > 0)
[1] FALSE

No, all columns and rows feature complete observations.

Scale Variables

We’re going to implement scaling of the variables in order to properly apply the PCA algorithm. First, we’ll separate the numberic and ID features from the data.

nba_numeric <- nba_stats |> select(
    where(is.numeric) &
        -c(profile_height_in, profile_weight_lbs)
)

nba_ids <- nba_stats |> select(
    where(is.character) |
        c(profile_games_played, profile_minutes, profile_height_in, profile_weight_lbs)
)

In order to avoid sensitivity to volume (i.e. one player playing more than another), we need to create volume adjusted measures based on minutes (i.e. points per “36 minutes”, an industry standard). We’ll do this for all integer columns (so as to preserve the integrity of “rate” statistics, like “three-point percentage”). We’ll then drop the games and minutes features:

nba_adjusted <- nba_numeric |>
    mutate(across(
        where(is.integer) & -c(profile_games_played, profile_minutes),
        ~ . * 36.0 / profile_minutes
    )) |>
    select(-c(profile_games_played, profile_minutes))

And finally, we scale the numeric variables:

nba_scaled <- scale(nba_adjusted)

Principal Component Analysis

Using the scaled variables, let’s perform PCA. We’ll try to cluster with and without PCA. We’ll have both data set versions at our disposal for clustering.

nba_pca <- princomp(nba_scaled)
summary(nba_pca)
Importance of components:
                         Comp.1    Comp.2    Comp.3     Comp.4    Comp.5
Standard deviation     7.907895 5.1462278 3.4041457 2.66083766 2.4306877
Proportion of Variance 0.321334 0.1360858 0.0595458 0.03638075 0.0303594
Cumulative Proportion  0.321334 0.4574198 0.5169656 0.55334632 0.5837057
                           Comp.6     Comp.7     Comp.8     Comp.9    Comp.10
Standard deviation     2.38120145 2.28542661 1.99469866 1.88094633 1.75536205
Proportion of Variance 0.02913581 0.02683919 0.02044511 0.01817974 0.01583318
Cumulative Proportion  0.61284153 0.63968072 0.66012583 0.67830557 0.69413876
                         Comp.11    Comp.12    Comp.13    Comp.14     Comp.15
Standard deviation     1.6568553 1.53539884 1.47993093 1.43593006 1.389579004
Proportion of Variance 0.0141060 0.01211371 0.01125428 0.01059501 0.009922048
Cumulative Proportion  0.7082448 0.72035847 0.73161275 0.74220776 0.752129812
                         Comp.16     Comp.17     Comp.18     Comp.19
Standard deviation     1.3641266 1.331451758 1.304909173 1.255396634
Proportion of Variance 0.0095619 0.009109315 0.008749745 0.008098354
Cumulative Proportion  0.7616917 0.770801027 0.779550773 0.787649127
                           Comp.20     Comp.21    Comp.22     Comp.23
Standard deviation     1.215643107 1.202024017 1.17196870 1.159173138
Proportion of Variance 0.007593588 0.007424396 0.00705776 0.006904488
Cumulative Proportion  0.795242715 0.802667111 0.80972487 0.816629358
                           Comp.24     Comp.25    Comp.26     Comp.27
Standard deviation     1.124481456 1.109380008 1.09655417 1.066552118
Proportion of Variance 0.006497398 0.006324053 0.00617867 0.005845195
Cumulative Proportion  0.823126756 0.829450809 0.83562948 0.841474675
                           Comp.28     Comp.29     Comp.30     Comp.31
Standard deviation     1.049545355 1.027979984 1.008532128 1.002988524
Proportion of Variance 0.005660272 0.005430054 0.005226541 0.005169241
Cumulative Proportion  0.847134946 0.852565001 0.857791541 0.862960782
                          Comp.32     Comp.33     Comp.34     Comp.35
Standard deviation     0.97724998 0.960136207 0.945637156 0.934723089
Proportion of Variance 0.00490734 0.004736969 0.004594983 0.004489529
Cumulative Proportion  0.86786812 0.872605092 0.877200074 0.881689604
                           Comp.36     Comp.37     Comp.38     Comp.39
Standard deviation     0.916978377 0.910562982 0.898229189 0.878372129
Proportion of Variance 0.004320689 0.004260444 0.004145808 0.003964532
Cumulative Proportion  0.886010293 0.890270737 0.894416544 0.898381077
                           Comp.40     Comp.41     Comp.42     Comp.43
Standard deviation     0.857086920 0.839611729 0.834929882 0.827178550
Proportion of Variance 0.003774719 0.003622362 0.003582077 0.003515875
Cumulative Proportion  0.902155795 0.905778157 0.909360234 0.912876108
                           Comp.44     Comp.45     Comp.46     Comp.47
Standard deviation     0.812988721 0.798641549 0.781463388 0.767881925
Proportion of Variance 0.003396283 0.003277469 0.003137994 0.003029868
Cumulative Proportion  0.916272391 0.919549861 0.922687855 0.925717723
                          Comp.48     Comp.49     Comp.50     Comp.51
Standard deviation     0.76373612 0.734865089 0.720837237 0.708591892
Proportion of Variance 0.00299724 0.002774918 0.002669988 0.002580045
Cumulative Proportion  0.92871496 0.931489880 0.934159868 0.936739913
                           Comp.52    Comp.53     Comp.54    Comp.55
Standard deviation     0.699893597 0.69560943 0.682230518 0.66152373
Proportion of Variance 0.002517091 0.00248637 0.002391647 0.00224867
Cumulative Proportion  0.939257004 0.94174337 0.944135021 0.94638369
                           Comp.56     Comp.57     Comp.58     Comp.59
Standard deviation     0.656140029 0.639355610 0.630872837 0.621482067
Proportion of Variance 0.002212218 0.002100486 0.002045119 0.001984687
Cumulative Proportion  0.948595909 0.950696395 0.952741514 0.954726201
                           Comp.60     Comp.61     Comp.62     Comp.63
Standard deviation     0.606892195 0.598237879 0.589569555 0.579751076
Proportion of Variance 0.001892596 0.001839004 0.001786097 0.001727102
Cumulative Proportion  0.956618797 0.958457801 0.960243897 0.961970999
                           Comp.64     Comp.65     Comp.66     Comp.67
Standard deviation     0.562726007 0.552033414 0.547205173 0.542456886
Proportion of Variance 0.001627155 0.001565906 0.001538634 0.001512047
Cumulative Proportion  0.963598154 0.965164060 0.966702693 0.968214740
                           Comp.68     Comp.69     Comp.70     Comp.71
Standard deviation     0.530315310 0.522451609 0.519397356 0.508823656
Proportion of Variance 0.001445118 0.001402578 0.001386227 0.001330361
Cumulative Proportion  0.969659858 0.971062436 0.972448663 0.973779023
                           Comp.72     Comp.73     Comp.74     Comp.75
Standard deviation     0.489730309 0.478104943 0.467514197 0.462894960
Proportion of Variance 0.001232392 0.001174577 0.001123116 0.001101032
Cumulative Proportion  0.975011415 0.976185992 0.977309107 0.978410139
                           Comp.76    Comp.77     Comp.78      Comp.79
Standard deviation     0.459996993 0.45336006 0.444074268 0.4281947615
Proportion of Variance 0.001087289 0.00105614 0.001013319 0.0009421446
Cumulative Proportion  0.979497427 0.98055357 0.981566886 0.9825090303
                            Comp.80      Comp.81      Comp.82      Comp.83
Standard deviation     0.4187124287 0.4039841057 0.3930316736 0.3881281819
Proportion of Variance 0.0009008792 0.0008386165 0.0007937614 0.0007740789
Cumulative Proportion  0.9834099095 0.9842485260 0.9850422874 0.9858163662
                            Comp.84      Comp.85      Comp.86     Comp.87
Standard deviation     0.3813032905 0.3714911500 0.3638091857 0.350640222
Proportion of Variance 0.0007470952 0.0007091397 0.0006801147 0.000631769
Cumulative Proportion  0.9865634615 0.9872726011 0.9879527158 0.988584485
                            Comp.88      Comp.89      Comp.90      Comp.91
Standard deviation     0.3483221088 0.3452766491 0.3381688860 0.3255143029
Proportion of Variance 0.0006234433 0.0006125891 0.0005876275 0.0005444713
Cumulative Proportion  0.9892079281 0.9898205172 0.9904081447 0.9909526160
                            Comp.92     Comp.93     Comp.94      Comp.95
Standard deviation     0.3169023497 0.312833968 0.310982739 0.2999009790
Proportion of Variance 0.0005160429 0.000502878 0.000496944 0.0004621581
Cumulative Proportion  0.9914686589 0.991971537 0.992468481 0.9929306390
                            Comp.96      Comp.97      Comp.98      Comp.99
Standard deviation     0.2969837314 0.2886732854 0.2839062227 0.2786547326
Proportion of Variance 0.0004532107 0.0004282014 0.0004141758 0.0003989952
Cumulative Proportion  0.9933838497 0.9938120511 0.9942262268 0.9946252221
                           Comp.100     Comp.101     Comp.102     Comp.103
Standard deviation     0.2670630443 0.2626119636 0.2506656178 0.2387134134
Proportion of Variance 0.0003664903 0.0003543756 0.0003228675 0.0002928117
Cumulative Proportion  0.9949917123 0.9953460880 0.9956689555 0.9959617673
                           Comp.104     Comp.105     Comp.106     Comp.107
Standard deviation     0.2378688367 0.2229413349 0.2168524945 0.2121617325
Proportion of Variance 0.0002907435 0.0002553971 0.0002416371 0.0002312964
Cumulative Proportion  0.9962525107 0.9965079079 0.9967495450 0.9969808414
                           Comp.108     Comp.109    Comp.110     Comp.111
Standard deviation     0.2066857499 0.1962149302 0.191377167 0.1867972530
Proportion of Variance 0.0002195108 0.0001978331 0.000188198 0.0001792982
Cumulative Proportion  0.9972003523 0.9973981853 0.997586383 0.9977656815
                           Comp.112     Comp.113     Comp.114     Comp.115
Standard deviation     0.1795094417 0.1750229081 0.1718757618 0.1656258088
Proportion of Variance 0.0001655806 0.0001574072 0.0001517973 0.0001409584
Cumulative Proportion  0.9979312621 0.9980886694 0.9982404667 0.9983814250
                           Comp.116     Comp.117     Comp.118     Comp.119
Standard deviation     0.1605428578 0.1578360974 0.1494406610 0.1433299991
Proportion of Variance 0.0001324393 0.0001280111 0.0001147552 0.0001055623
Cumulative Proportion  0.9985138643 0.9986418754 0.9987566306 0.9988621930
                           Comp.120     Comp.121     Comp.122     Comp.123
Standard deviation     1.329839e-01 0.1305931473 1.276026e-01 1.256506e-01
Proportion of Variance 9.087267e-05 0.0000876346 8.366689e-05 8.112669e-05
Cumulative Proportion  9.989531e-01 0.9990407002 9.991244e-01 9.992055e-01
                           Comp.124     Comp.125     Comp.126     Comp.127
Standard deviation     1.199905e-01 1.117375e-01 1.081000e-01 1.039045e-01
Proportion of Variance 7.398242e-05 6.415533e-05 6.004632e-05 5.547577e-05
Cumulative Proportion  9.992795e-01 9.993436e-01 9.994037e-01 9.994592e-01
                           Comp.128     Comp.129     Comp.130     Comp.131
Standard deviation     1.003603e-01 9.682056e-02 9.374808e-02 8.973321e-02
Proportion of Variance 5.175581e-05 4.816927e-05 4.516059e-05 4.137531e-05
Cumulative Proportion  9.995109e-01 9.995591e-01 9.996042e-01 9.996456e-01
                           Comp.132     Comp.133     Comp.134     Comp.135
Standard deviation     8.790406e-02 0.0844869330 7.616143e-02 7.421226e-02
Proportion of Variance 3.970569e-05 0.0000366787 2.980609e-05 2.829998e-05
Cumulative Proportion  9.996853e-01 0.9997219990 9.997518e-01 9.997801e-01
                           Comp.136     Comp.137     Comp.138     Comp.139
Standard deviation     6.815107e-02 6.730069e-02 6.484186e-02 6.177823e-02
Proportion of Variance 2.386603e-05 2.327415e-05 2.160458e-05 1.961127e-05
Cumulative Proportion  9.998040e-01 9.998272e-01 9.998488e-01 9.998685e-01
                           Comp.140     Comp.141     Comp.142     Comp.143
Standard deviation     5.866739e-02 5.572205e-02 5.361931e-02 5.199413e-02
Proportion of Variance 1.768595e-05 1.595472e-05 1.477329e-05 1.389132e-05
Cumulative Proportion  9.998861e-01 9.999021e-01 9.999169e-01 9.999308e-01
                           Comp.144     Comp.145     Comp.146     Comp.147
Standard deviation     4.881348e-02 4.206109e-02 3.979695e-02 3.752638e-02
Proportion of Variance 1.224375e-05 9.090668e-06 8.138312e-06 7.236161e-06
Cumulative Proportion  9.999430e-01 9.999521e-01 9.999602e-01 9.999675e-01
                           Comp.148     Comp.149     Comp.150     Comp.151
Standard deviation     3.598320e-02 3.098396e-02 3.024330e-02 2.483474e-02
Proportion of Variance 6.653260e-06 4.932972e-06 4.699951e-06 3.169233e-06
Cumulative Proportion  9.999741e-01 9.999791e-01 9.999838e-01 9.999869e-01
                           Comp.152     Comp.153     Comp.154     Comp.155
Standard deviation     2.028335e-02 1.869018e-02 1.830842e-02 1.732935e-02
Proportion of Variance 2.114045e-06 1.794988e-06 1.722410e-06 1.543119e-06
Cumulative Proportion  9.999890e-01 9.999908e-01 9.999926e-01 9.999941e-01
                           Comp.156     Comp.157     Comp.158     Comp.159
Standard deviation     1.607269e-02 1.482187e-02 1.359563e-02 1.154321e-02
Proportion of Variance 1.327431e-06 1.128862e-06 9.498035e-07 6.846808e-07
Cumulative Proportion  9.999954e-01 9.999966e-01 9.999975e-01 9.999982e-01
                           Comp.160     Comp.161     Comp.162     Comp.163
Standard deviation     9.457766e-03 7.895078e-03 7.487837e-03 7.187040e-03
Proportion of Variance 4.596339e-07 3.202932e-07 2.881029e-07 2.654208e-07
Cumulative Proportion  9.999987e-01 9.999990e-01 9.999993e-01 9.999995e-01
                           Comp.164     Comp.165     Comp.166     Comp.167
Standard deviation     6.987417e-03 6.537667e-03 1.141043e-07 4.703629e-08
Proportion of Variance 2.508812e-07 2.196243e-07 6.690194e-17 1.136844e-17
Cumulative Proportion  9.999998e-01 1.000000e+00 1.000000e+00 1.000000e+00
                           Comp.168     Comp.169     Comp.170     Comp.171
Standard deviation     2.919608e-08 2.340045e-08 1.517024e-08 1.481497e-08
Proportion of Variance 4.380101e-18 2.813736e-18 1.182551e-18 1.127812e-18
Cumulative Proportion  1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00
                           Comp.172     Comp.173 Comp.174 Comp.175 Comp.176
Standard deviation     1.461163e-08 1.262019e-08        0        0        0
Proportion of Variance 1.097064e-18 8.184013e-19        0        0        0
Cumulative Proportion  1.000000e+00 1.000000e+00        1        1        1
                       Comp.177 Comp.178 Comp.179 Comp.180 Comp.181 Comp.182
Standard deviation            0        0        0        0        0        0
Proportion of Variance        0        0        0        0        0        0
Cumulative Proportion         1        1        1        1        1        1
                       Comp.183 Comp.184 Comp.185 Comp.186 Comp.187 Comp.188
Standard deviation            0        0        0        0        0        0
Proportion of Variance        0        0        0        0        0        0
Cumulative Proportion         1        1        1        1        1        1
                       Comp.189 Comp.190 Comp.191 Comp.192 Comp.193 Comp.194
Standard deviation            0        0        0        0        0        0
Proportion of Variance        0        0        0        0        0        0
Cumulative Proportion         1        1        1        1        1        1
                       Comp.195
Standard deviation            0
Proportion of Variance        0
Cumulative Proportion         1

We have 195 numeric source variables so 195 principal components. The first 15 components explain 75% of variance, the first 40 explain 90% of the variance and the first 57 principal components explain 95%. That means just over 1/4 of our original number of features explain nearly all of the total variance.

PCA did a good job at 1) reducing dimensionality and 2) eliminating any colinearity of features.

Let’s save our results:

nba_pca_data <- as.data.frame(nba_pca$scores)

Let’s visualize the first two principal components:

ggplot(
    nba_pca_data,
    aes(Comp.1, Comp.2)
) +
    geom_point()

Clustering

Let’s try to cluster these observations.

Historically, basketball has used 5 positions. In modern times, this has been reduced to approximately 3. Let’s use 3 as our minimum and 12 as our maximum. Let’s try a few different clustering techniques.

clustering_grid <- data.frame(
    clusters = 3:20
)

Against each of these, we can run different clustering algorithms and produce measures for the “within sum of squares”. This will help us determine the proper number of clusters derived from the data.

Partition Clustering

Let’s setup a function to run a kmeans cluster for every number of cluster in the above grid and generate the respective performance metric:

cluster_kmeans <- function(k, data) {
    fit <- kmeans(data, k)
    vals <- glance(fit)
    return(vals$tot.withinss)
}

Hierarchical Clustering

Let’s do the same thing but for an hclust algorithm:

cluster_hclust <- function(k, data) {
    # Run the algorithm
    model <- hier_clust(num_clusters = k, linkage_method = "complete")
    fit <- model |> fit(~., data = as.data.frame(data))

    wss <- fit |>
        sse_within() |>
        select(wss) |>
        unlist() |>
        sum()
    return(wss)
}

Let’s now generate our clusters!

Cluster Results

Let’s map over the number of clusters and execute the respective algorithm.

clustering_grid_01 <-
    clustering_grid |>
    mutate(
        kmeans = map(clusters, ~ cluster_kmeans(.x, nba_scaled)),
        hclust = map(clusters, ~ cluster_hclust(.x, nba_scaled))
    )

We can now plot these and find the “elbow”, or the point of diminishing returns from an increasing the number of clusters.

clustering_grid_01 |>
    pivot_longer(cols = -clusters) |>
    unnest(value) |>
    ggplot(aes(factor(clusters), as.numeric(value))) +
    geom_line(aes(color = name), group = 1) +
    facet_wrap(~name, ncol = 1, scales = "free")

hclust gives the impression that around 10 is the right number of clusters, though the elbow is difficult to identify. kmeans suggests 9 or 10.

Let’s see if we get different results using just the first 53 principal components.

clustering_grid_02 <-
    clustering_grid |>
    mutate(
        kmeans = map(clusters, ~ cluster_kmeans(.x, nba_pca_data[, 1:57])),
        hclust = map(clusters, ~ cluster_hclust(.x, nba_pca_data[, 1:57]))
    )
clustering_grid_02 |>
    pivot_longer(cols = -clusters) |>
    unnest(value) |>
    ggplot(aes(factor(clusters), as.numeric(value))) +
    geom_line(aes(color = name), group = 1) +
    facet_wrap(~name, ncol = 1, scales = "free")

The same algorithms with the first 57 principal components indicate somewhere around 9 to 10. Let’s proceed with the PCA results and assume clusters of 10. I also think the smoothing of kmeans is a little nicer so let’s default to that algorithm.

set.seed(2015)
fit <- kmeans(nba_pca_data[, 1:57], 10)

nba_ids$cluster <- factor(fit$cluster)
nba_adjusted_full <- nba_adjusted |> mutate(cluster = factor(fit$cluster))
nba_scaled_full <- as.data.frame(nba_scaled) |> mutate(cluster = factor(fit$cluster))

nba_ids |>
    count(cluster) |>
    mutate(prop = n / sum(n))
   cluster  n  prop
1        1 37 0.074
2        2 64 0.128
3        3 50 0.100
4        4 38 0.076
5        5 52 0.104
6        6 58 0.116
7        7 99 0.198
8        8 59 0.118
9        9 28 0.056
10      10 15 0.030

The initial results seem fairly reasonable. Understandably, some clusters (or as we would interpret, “positions”/“roles”/“styles”) have more players than others given the nature of the game.

Let’s evaluate some specific players and get a sense for the results.

Cluster Evaluation

The first example deals with 4 players typically thought of as “centers”. Their physical profiles are somewhat similar but there are significant differences in style and role. We should probably see three different cluster assignments.

nba_ids |> filter(
    profile_name %in% c(
        "Victor Wembanyama",
        "Nikola Jokic",
        "Clint Capela",
        "Rudy Gobert"
    )
)
       profile_name profile_team_abbreviation profile_games_played
1      Nikola Jokic                       DEN                  313
2       Rudy Gobert                       MIN                  306
3      Clint Capela                       ATL                  300
4 Victor Wembanyama                       SAS                   90
  profile_minutes profile_height_in profile_weight_lbs cluster
1           10738                83                284      10
2            9837                85                258       9
3            8121                82                256       9
4            2717                87                235      10

We see Capela and Gobert with the same cluster assignment, making sense, but Jokic and Wembanyama are also assigned to the same. Given their offensive game, this could make sense, as most of their differentiators are on the defensive end.

Let’s try another. These three all have similar physical profiles, roles, and play styles. Let’s see how they are clustered.

nba_ids |> filter(
    profile_name %in% c(
        "Jimmy Butler",
        "Jayson Tatum",
        "Jaylen Brown"
    )
)
  profile_name profile_team_abbreviation profile_games_played profile_minutes
1 Jayson Tatum                       BOS                  311           11237
2 Jaylen Brown                       BOS                  280            9653
3 Jimmy Butler                       MIA                  250            8402
  profile_height_in profile_weight_lbs cluster
1                80                210       4
2                78                223       4
3                79                230       4

We see they all fall into the same cluster! This gives some assurance that the clustering is capturing some of the inherent patterns.

What about all players who’ve historically been labeled “point guards”. Each of these play so differently we should see completely different cluster assignments.

nba_ids |> filter(
    profile_name %in% c(
        "Collin Sexton",
        "Bruce Brown",
        "Stephen Curry",
        "Jose Alvarado"
    )
)
   profile_name profile_team_abbreviation profile_games_played profile_minutes
1 Stephen Curry                       GSW                  275            9275
2   Bruce Brown                       TOR                  284            7370
3 Collin Sexton                       UTA                  220            6293
4 Jose Alvarado                       NOP                  182            3454
  profile_height_in profile_weight_lbs cluster
1                74                185       4
2                76                202       2
3                75                190       4
4                73                179       6

All different except for Curry and Sexton. We’ll have to dig into the cluster more closely to learn about this. So far its tracking pretty close to what a contextual lens might suggest.

Let’s look at at some of the most dissimilar players from a physical profile that have the same cluster.

getMinMax <- function(data, cluster = 1) {
    data_f <- data[data$cluster == cluster, ]
    data_f$val <- (scale(data_f$profile_height_in) + scale(data_f$profile_weight_lbs)) / 2
    data_f <- data_f |> arrange(desc(val))

    return(c(
        data_f$profile_name[nrow(data_f)],
        data_f$profile_name[1]
    ))
}
for (c in sort(unique(nba_ids$cluster))) {
    players <- getMinMax(nba_ids, c)
    print(paste(
        c, "- Min:", players[1],
        "| Max:", players[2]
    ))
}
[1] "1 - Min: Scotty Pippen Jr. | Max: Ben Simmons"
[1] "2 - Min: Bruce Brown | Max: Brook Lopez"
[1] "3 - Min: Seth Curry | Max: Danilo Gallinari"
[1] "4 - Min: Trae Young | Max: Paolo Banchero"
[1] "5 - Min: Ausar Thompson | Max: Robin Lopez"
[1] "6 - Min: D.J. Augustin | Max: Joe Ingles"
[1] "7 - Min: Johnny Davis | Max: Maxi Kleber"
[1] "8 - Min: Isaiah Joe | Max: Mike Muscala"
[1] "9 - Min: Isaiah Jackson | Max: JaVale McGee"
[1] "10 - Min: Domantas Sabonis | Max: Jusuf Nurkic"

Generally speaking, these make a lot of sense. The next step would be to analyze each cluster and come up with unique labels for them that describe the new position/role/style.

Cluster Naming

Clusters are labeled arbitrarily. There’s nothing intuitive by labels 1, 2, etc. We need to give these clusters meaning by assigning 1 a descriptive label.

We’ll do that two ways:

  1. We’ll create a penalized logistic regression model for each individual cluster and select the highest absolute value of the coefficients. In this way, we can understand some of the predictors that define the cluster.

  2. We’ll leverage AI to use what it knows about the players in each cluster to give label suggestions.

We’ll pool these perspectives together to generate our own label. We’ll save the label in the following table:

cluster_labels <- tibble(
    cluster = factor(1:10),
    label = as.character(NA), 
    abbrev = as.character(NA), 
)

Guidelines

We want to shy away from traditional language: “guard”, “forward”, “center”. Even labels like “backcourt”, “frontcourt” can pigeonhole a group of players in unhelpful ways, potentially. Additionally, modern terms like “wing” and “big” we may want to shy away from.

This puts more emphasis on style of play and role than physical profile or position.

Top Features Model

Here’s the function that will ingest our dataset, classifify against a binary target (1 = cluster of interest, 0 = all other clusters).

We’ll perform cross validation, take the best model, fit on the entire data set, and take the top coefficients.

get_elasnet_top_features <- function(data) {
    # Configure recipe
    mod_rec <- recipe(target ~ ., data)

    # Setup cross-validation folds
    mod_cv <- rsample::vfold_cv(data, v = 5)

    # Configure tuning grid
    mod_tune_grid <- grid_regular(
        penalty(),
        mixture(),
        levels = 4
    )

    # Setup model definition
    mod_def <- logistic_reg(
        mixture = tune(),
        penalty = tune()
    ) |>
        set_engine("glmnet")

    # Configure workflow
    mod_wflw <-
        workflow() |>
        add_model(mod_def) |>
        add_recipe(mod_rec)

    # Run cross-validated tuning
    set.seed(814)
    mod_tune <-
        mod_wflw |>
        tune_grid(
            resamples = mod_cv,
            grid = mod_tune_grid,
            metrics = metric_set(roc_auc)
        )

    # Select & fit best model
    best_mod <- mod_tune |> select_best(metric = "roc_auc")
    final_wflw <- mod_wflw |> finalize_workflow(best_mod)
    final_fit <- fit(final_wflw, data = data)

    # Capture the top predictors by absolute value of coefficient
    tidy(final_fit) |>
        arrange(desc(abs(estimate))) |>
        slice_head(n = 10) |>
        select(-penalty) |>
        print()
}

We’ll call this for each cluster.

We’ll setup some parallelization for this:

set.seed(729)
# Define parallelization
cores_target <- ceiling(parallel::detectCores() * 0.75)
doParallel::registerDoParallel(cores = cores_target)

Aritificial Intelligence

We’re going to let AI suggest some labels. All it will see is 1) our prompt and 2) the player names pertaining to the cluster. This will lead to a less biased approach

This is the prompt we’ll use (along with the list of names) against OpenAI’s ChatGPT 4o model:

Below are a list of recent NBA player names. Assume these players belong in a collective group based on their play style and role. Generate 5 unique suggestions for a group label that is short and sweet but descriptive of the group. Restrict evaluation to style and role; avoid analysis rooted in reputation, playing time, etc.

Cluster #1

Player names:

nba_ids[nba_ids$cluster == "1", ]$profile_name
 [1] "Russell Westbrook"   "Draymond Green"      "Kyle Anderson"      
 [4] "Cole Anthony"        "Josh Giddey"         "T.J. McConnell"     
 [7] "Tre Jones"           "Killian Hayes"       "Talen Horton-Tucker"
[10] "Jalen Suggs"         "Jaden Ivey"          "Ben Simmons"        
[13] "Theo Maledon"        "Dennis Smith Jr."    "Ricky Rubio"        
[16] "Markelle Fultz"      "Ish Smith"           "R.J. Hampton"       
[19] "Kris Dunn"           "Derrick Rose"        "Dalano Banton"      
[22] "Tomas Satoransky"    "Scoot Henderson"     "Elfrid Payton"      
[25] "Jordan Goodwin"      "Josh Christopher"    "Saben Lee"          
[28] "Trent Forrest"       "Blake Wesley"        "Keon Johnson"       
[31] "Rajon Rondo"         "Vasilije Micic"      "Brad Wanamaker"     
[34] "Daishen Nix"         "Jared Butler"        "Scotty Pippen Jr."  
[37] "Brandon Goodwin"    

Let’s pull the top features explaining this cluster assignment:

target_cluster_df <- nba_scaled_full |>
    mutate(target = factor(ifelse(cluster == "1", 1, 0))) |>
    select(-cluster)

get_elasnet_top_features(target_cluster_df)
# A tibble: 10 × 2
   term                                        estimate
   <chr>                                          <dbl>
 1 (Intercept)                                   -6.99 
 2 fouls_transition_take_fouls_drawn              0.385
 3 misc_on_off_rtg                               -0.344
 4 shot_short_mid_range_frequency                 0.333
 5 turnovers_lost_ball_out_of_bounds_turnovers    0.301
 6 second_second_chance_turnovers                 0.300
 7 misc_plus_minus                               -0.273
 8 fouls_clear_path_fouls                         0.271
 9 misc_blocked_corner3                           0.265
10 free_technical_free_throw_trips               -0.263

These top features are interesting. There’s clear evidence of a) athleticism and skill in the open court, b) some tendency toward mistakes, and 3) overall sub-impact.

ChatGPT generated the following label suggestions:

  • Playmaking Hustlers
  • Versatile Initiators
  • Dynamic Facilitators
  • Crafty Drivers
  • Hybrid Creators

I’m somewhat drawn to words like “initiator” and “hustler”. I don’t see any evidence of “facilitator” or “creator”. There’s a good mix of physical profile and style. Let’s go with “Versatile Anchor”.

# Save cluster label
cluster_labels[cluster_labels$cluster == "1", 2] <- "Versatile Anchor"
cluster_labels[cluster_labels$cluster == "1", 3] <- "VA"

Cluster #2

Player names:

nba_ids[nba_ids$cluster == "2", ]$profile_name
 [1] "Nikola Vucevic"        "Pascal Siakam"         "Tobias Harris"        
 [4] "Josh Hart"             "Keldon Johnson"        "Kyle Kuzma"           
 [7] "Aaron Gordon"          "John Collins"          "Deni Avdija"          
[10] "Lauri Markkanen"       "Scottie Barnes"        "Bobby Portis"         
[13] "Brook Lopez"           "Miles Bridges"         "Bruce Brown"          
[16] "Myles Turner"          "Michael Porter Jr."    "Jaren Jackson Jr."    
[19] "Rui Hachimura"         "Kelly Olynyk"          "Naz Reid"             
[22] "KJ Martin"             "Christian Wood"        "Jae'Sean Tate"        
[25] "Jabari Smith Jr."      "Jalen Williams"        "Jonathan Kuminga"     
[28] "Larry Nance Jr."       "Moritz Wagner"         "Trey Lyles"           
[31] "Kevin Love"            "Bennedict Mathurin"    "Darius Bazley"        
[34] "Santi Aldama"          "Jeremy Sochan"         "Jalen Johnson"        
[37] "Zach Collins"          "Hamidou Diallo"        "Aleksej Pokusevski"   
[40] "Keita Bates-Diop"      "Chimezie Metu"         "Trendon Watford"      
[43] "JaMychal Green"        "Dario Saric"           "Isaiah Roby"          
[46] "Blake Griffin"         "Chet Holmgren"         "Josh Jackson"         
[49] "Otto Porter Jr."       "Bol Bol"               "Justise Winslow"      
[52] "Serge Ibaka"           "Anthony Gill"          "Jaylin Williams"      
[55] "Nemanja Bjelica"       "LaMarcus Aldridge"     "Sandro Mamukelashvili"
[58] "Paul Millsap"          "Eric Paschall"         "Duop Reath"           
[61] "David Nwaba"           "Gorgui Dieng"          "Frank Kaminsky"       
[64] "Eugene Omoruyi"       

Let’s pull the top features explaining this cluster assignment:

target_cluster_df <- nba_scaled_full |>
    mutate(target = factor(ifelse(cluster == "2", 1, 0))) |>
    select(-cluster)

get_elasnet_top_features(target_cluster_df)
# A tibble: 10 × 2
   term                                  estimate
   <chr>                                    <dbl>
 1 (Intercept)                             -4.51 
 2 second_second_chance_corner3frequency   -0.703
 3 turnovers_x3second_violations           -0.669
 4 second_second_chance_at_rim_frequency    0.624
 5 second_second_chance_arc3frequency      -0.561
 6 free_technical_free_throw_trips         -0.555
 7 shot_long_mid_range_pct_blocked          0.443
 8 shot_at_rim_accuracy                     0.414
 9 fouls_charge_fouls_drawn                 0.409
10 scoring_fg2a_blocked                     0.384

What first catches me eye with these features are second_change_at_rim_frequency and at_rim_accuracy. These are players oriented near the basket. Next what catches my eye are some of the second_change_3 variations but that are negative. As in lower frequency on second chance opportunities but not necessarily lower on overall 3s.

ChatGPT generated the following label suggestions:

  • Versatile Wings
  • Dynamic Bigs
  • Stretch Forwards
  • Two-Way Frontcourt
  • Hybrid Playmakers

In comparing these options to the player list set, I’m drawn to the “Hybrid Playmakers” label. It’s not very descriptive. We have again, a good mix of physical profile and style. Therefore, let’s settle on “Versatile Finisher”.

# Save cluster label
cluster_labels[cluster_labels$cluster == "2", 2] <- "Versatile Finisher"
cluster_labels[cluster_labels$cluster == "2", 3] <- "VF"

Cluster #3

Player names:

nba_ids[nba_ids$cluster == "3", ]$profile_name
 [1] "Mikal Bridges"        "RJ Barrett"           "Terry Rozier"        
 [4] "Gary Trent Jr."       "Saddiq Bey"           "Luguentz Dort"       
 [7] "Andrew Wiggins"       "Tyler Herro"          "Franz Wagner"        
[10] "Jalen Green"          "Jerami Grant"         "Dillon Brooks"       
[13] "Norman Powell"        "Bojan Bogdanovic"     "Desmond Bane"        
[16] "Bogdan Bogdanovic"    "Devin Vassell"        "Eric Gordon"         
[19] "De'Andre Hunter"      "Josh Richardson"      "Alec Burks"          
[22] "Klay Thompson"        "Seth Curry"           "Gordon Hayward"      
[25] "Marcus Morris Sr."    "Lonnie Walker IV"     "Will Barton"         
[28] "Evan Fournier"        "Terrence Ross"        "Danilo Gallinari"    
[31] "Carmelo Anthony"      "Malaki Branham"       "Jaylen Nowell"       
[34] "Jordan Nwora"         "Shaedon Sharpe"       "Rudy Gay"            
[37] "Chris Duarte"         "Furkan Korkmaz"       "Brandon Miller"      
[40] "Kendrick Nunn"        "Terence Davis"        "Jaden Hardy"         
[43] "Brandon Boston Jr."   "Dwayne Bacon"         "Jeremy Lamb"         
[46] "AJ Griffin"           "Jordan Hawkins"       "Duane Washington Jr."
[49] "Denzel Valentine"     "GG Jackson II"       

Let’s pull the top features explaining this cluster assignment:

target_cluster_df <- nba_scaled_full |>
    mutate(target = factor(ifelse(cluster == "3", 1, 0))) |>
    select(-cluster)

get_elasnet_top_features(target_cluster_df)
# A tibble: 10 × 2
   term                                 estimate
   <chr>                                   <dbl>
 1 (Intercept)                            -6.45 
 2 misc_at_rim_off_rebounded_pct           0.560
 3 assists_long_mid_range_assists         -0.507
 4 fouls_charge_fouls_drawn               -0.475
 5 second_second_chance_at_rim_accuracy    0.445
 6 shot_corner3frequency                  -0.436
 7 shot_short_mid_range_pct_assisted      -0.428
 8 assists_arc3assists                    -0.415
 9 shot_heave_attempts                     0.410
10 assists_three_pt_assists               -0.392

What catches my eye are lower on assist features than other groups but higher rebounders and some hustle stuff here.

ChatGPT generated the following label suggestions:

  • Scoring Wings
  • Perimeter Playmakers
  • Versatile Shooters
  • Dynamic Swingmen
  • Offensive Engines

I see all of these words in this group to some extent. The combo that most seems interesting is probably “versatile” and “engine”, so we’ll go with “Versatile Engine”.

# Save cluster label
cluster_labels[cluster_labels$cluster == "3", 2] <- "Versatile Engine"
cluster_labels[cluster_labels$cluster == "3", 3] <- "VE"

Cluster #4

Player names:

nba_ids[nba_ids$cluster == "4", ]$profile_name
 [1] "Jayson Tatum"            "Anthony Edwards"        
 [3] "DeMar DeRozan"           "Dejounte Murray"        
 [5] "Luka Doncic"             "De'Aaron Fox"           
 [7] "Trae Young"              "Jalen Brunson"          
 [9] "Devin Booker"            "Jaylen Brown"           
[11] "James Harden"            "Darius Garland"         
[13] "Stephen Curry"           "Donovan Mitchell"       
[15] "Shai Gilgeous-Alexander" "Damian Lillard"         
[17] "LeBron James"            "Zach LaVine"            
[19] "Jordan Poole"            "Chris Paul"             
[21] "Jimmy Butler"            "Brandon Ingram"         
[23] "Kevin Durant"            "Kyrie Irving"           
[25] "Jordan Clarkson"         "Paul George"            
[27] "Bradley Beal"            "Khris Middleton"        
[29] "Ja Morant"               "LaMelo Ball"            
[31] "Jamal Murray"            "Collin Sexton"          
[33] "Kawhi Leonard"           "Paolo Banchero"         
[35] "Cade Cunningham"         "Kevin Porter Jr."       
[37] "Cam Thomas"              "John Wall"              

Let’s pull the top features explaining this cluster assignment:

target_cluster_df <- nba_scaled_full |>
    mutate(target = factor(ifelse(cluster == "4", 1, 0))) |>
    select(-cluster)

get_elasnet_top_features(target_cluster_df)
# A tibble: 10 × 2
   term                            estimate
   <chr>                              <dbl>
 1 (Intercept)                      -3.42  
 2 scoring_pts_unassisted2s          0.0598
 3 scoring_pts_unassisted3s          0.0577
 4 free_technical_free_throw_trips   0.0566
 5 misc_first_chance_points          0.0537
 6 scoring_points                    0.0517
 7 shot_long_mid_range_fgm           0.0509
 8 scoring_usage                     0.0507
 9 shot_long_mid_range_fga           0.0492
10 scoring_ft_points                 0.0471

This group is pretty clear. Scoring, creating, playmaking.

ChatGPT generated the following label suggestions:

  • Elite Creators
  • Dynamic Scorers
  • Playmaking Stars
  • Offensive Leaders
  • All-Around Playmakers

To borrow a word from a previous cluster, I like the word “engine”. These are the players that engage the team’s “drivetrain”, so to speak. Let’s go with “Perimeter Engine”.

# Save cluster label
cluster_labels[cluster_labels$cluster == "4", 2] <- "Perimeter Engine"
cluster_labels[cluster_labels$cluster == "4", 3] <- "PE"

Cluster #5

Player names:

nba_ids[nba_ids$cluster == "5", ]$profile_name
 [1] "Jarrett Allen"          "Deandre Ayton"          "Evan Mobley"           
 [4] "Isaiah Stewart"         "Wendell Carter Jr."     "Nic Claxton"           
 [7] "Kevon Looney"           "Chris Boucher"          "Isaiah Hartenstein"    
[10] "Jarred Vanderbilt"      "Onyeka Okongwu"         "Precious Achiuwa"      
[13] "Dwight Powell"          "Drew Eubanks"           "Marvin Bagley III"     
[16] "Brandon Clarke"         "Mo Bamba"               "Xavier Tillman Sr."    
[19] "Jaxson Hayes"           "Daniel Theis"           "Montrezl Harrell"      
[22] "Richaun Holmes"         "Thaddeus Young"         "Jalen Smith"           
[25] "Nick Richards"          "Goga Bitadze"           "Paul Reed"             
[28] "James Wiseman"          "Tari Eason"             "Taj Gibson"            
[31] "Khem Birch"             "Luke Kornet"            "Jabari Walker"         
[34] "Jock Landale"           "Zeke Nnaji"             "Thomas Bryant"         
[37] "Alex Len"               "Robin Lopez"            "Damian Jones"          
[40] "Nerlens Noel"           "Enes Freedom"           "Amen Thompson"         
[43] "Wenyen Gabriel"         "Dewayne Dedmon"         "Derrick Favors"        
[46] "Ausar Thompson"         "Jonathan Isaac"         "Thanasis Antetokounmpo"
[49] "Omer Yurtseven"         "Tony Bradley"           "Usman Garuba"          
[52] "Terry Taylor"          

Let’s pull the top features explaining this cluster assignment:

target_cluster_df <- nba_scaled_full |>
    mutate(target = factor(ifelse(cluster == "5", 1, 0))) |>
    select(-cluster)

get_elasnet_top_features(target_cluster_df)
# A tibble: 10 × 2
   term                                       estimate
   <chr>                                         <dbl>
 1 (Intercept)                                  -7.40 
 2 shot_corner3pct_assisted                      0.497
 3 shot_arc3pct_assisted                         0.440
 4 shot_heave_attempts                           0.426
 5 scoring_assisted3s_pct                        0.392
 6 rebounds_self_o_reb_pct                      -0.319
 7 turnovers_bad_pass_out_of_bounds_turnovers   -0.310
 8 second_second_chance_fg3pct                   0.310
 9 turnovers_travels                             0.260
10 second_second_chance_arc3frequency           -0.254

What’s interesting about this group’s features compared to the list of players relates to how much is assisted and the propencity for 3P field goals.

ChatGPT generated the following label suggestions:

  • Rim Protectors
  • Paint Enforcers
  • Dynamic Bigs
  • Post Specialists
  • Interior Anchors

“Dynamic Bigs” is the only example I like but we’re trying to shy away from physical profiles and stick with style/role. Let’s target “Interior Connector”.

# Save cluster label
cluster_labels[cluster_labels$cluster == "5", 2] <- "Interior Connector"
cluster_labels[cluster_labels$cluster == "5", 3] <- "IC"

Cluster #6

Player names:

nba_ids[nba_ids$cluster == "6", ]$profile_name
 [1] "Fred VanVleet"            "Tyrese Haliburton"       
 [3] "Jrue Holiday"             "Coby White"              
 [5] "Dennis Schroder"          "CJ McCollum"             
 [7] "Tyrese Maxey"             "Derrick White"           
 [9] "D'Angelo Russell"         "Mike Conley"             
[11] "Caris LeVert"             "Kyle Lowry"              
[13] "Reggie Jackson"           "Immanuel Quickley"       
[15] "Spencer Dinwiddie"        "Tyus Jones"              
[17] "Anfernee Simons"          "Malik Monk"              
[19] "Marcus Smart"             "Austin Reaves"           
[21] "Malcolm Brogdon"          "De'Anthony Melton"       
[23] "Nickeil Alexander-Walker" "Monte Morris"            
[25] "Devonte' Graham"          "Delon Wright"            
[27] "Payton Pritchard"         "Davion Mitchell"         
[29] "Joe Ingles"               "Shake Milton"            
[31] "Cameron Payne"            "Gabe Vincent"            
[33] "Cory Joseph"              "Aaron Holiday"           
[35] "Andrew Nembhard"          "Tre Mann"                
[37] "Eric Bledsoe"             "Jose Alvarado"           
[39] "Jordan McLaughlin"        "Raul Neto"               
[41] "Lonzo Ball"               "Bones Hyland"            
[43] "Malachi Flynn"            "Ty Jerome"               
[45] "George Hill"              "Keyonte George"          
[47] "Goran Dragic"             "Facundo Campazzo"        
[49] "Brandin Podziemski"       "Kemba Walker"            
[51] "Victor Oladipo"           "Lou Williams"            
[53] "D.J. Augustin"            "Kira Lewis Jr."          
[55] "Frank Ntilikina"          "Marcus Sasser"           
[57] "Trey Burke"               "Skylar Mays"             

Let’s pull the top features explaining this cluster assignment:

target_cluster_df <- nba_scaled_full |>
    mutate(target = factor(ifelse(cluster == "6", 1, 0))) |>
    select(-cluster)

get_elasnet_top_features(target_cluster_df)
# A tibble: 10 × 2
   term                                     estimate
   <chr>                                       <dbl>
 1 (Intercept)                                -6.23 
 2 fouls_charge_fouls_drawn                    0.496
 3 shot_long_mid_range_pct_assisted           -0.421
 4 fouls_transition_take_fouls_drawn          -0.387
 5 free_non_shooting_fouls_drawn              -0.381
 6 free_three_pt_shooting_fouls_drawn         -0.372
 7 shot_short_mid_range_pct_assisted          -0.371
 8 scoring_assisted2s_pct                     -0.354
 9 scoring_non_putbacks_assisted2s_pct        -0.339
10 second_second_chance_corner3pct_assisted    0.334

These features are interesting. What sticks out? a) their scoring isn’t really assisted and 2) there’s some unique approach to defense where they draw a lot of fouls.

ChatGPT generated the following label suggestions:

  • Floor Generals
  • Playmaking Guards
  • Perimeter Orchestrators
  • Dynamic Ball Handlers
  • Backcourt Catalysts

Again, we’re trying to shy away from traditional terminology. I like “orchestrator” as it eludes more responsibility than just “facilitator”. However, it’s often used mostly for traditional guard positions. Let’s go with “Perimeter Anchor”, instead.

# Save cluster label
cluster_labels[cluster_labels$cluster == "6", 2] <- "Perimeter Anchor"
cluster_labels[cluster_labels$cluster == "6", 3] <- "PA"

Cluster #7

Player names:

nba_ids[nba_ids$cluster == "7", ]$profile_name
 [1] "Harrison Barnes"        "Royce O'Neale"          "Dorian Finney-Smith"   
 [4] "P.J. Washington"        "Jaden McDaniels"        "Isaac Okoro"           
 [7] "OG Anunoby"             "Kelly Oubre Jr."        "Terance Mann"          
[10] "Grant Williams"         "Ayo Dosunmu"            "Nicolas Batum"         
[13] "Al Horford"             "Herbert Jones"          "Caleb Martin"          
[16] "Patrick Williams"       "Alex Caruso"            "Jeff Green"            
[19] "Taurean Prince"         "P.J. Tucker"            "Matisse Thybulle"      
[22] "Patrick Beverley"       "Derrick Jones Jr."      "Torrey Craig"          
[25] "Robert Covington"       "Aaron Nesmith"          "Obi Toppin"            
[28] "Josh Green"             "Naji Marshall"          "Kenrich Williams"      
[31] "Maxi Kleber"            "John Konchar"           "Dean Wade"             
[34] "Troy Brown Jr."         "Jalen McDaniels"        "Cody Martin"           
[37] "Josh Okogie"            "Aaron Wiggins"          "Cam Reddish"           
[40] "Chuma Okeke"            "Oshae Brissett"         "Christian Braun"       
[43] "Ochai Agbaji"           "Ziaire Williams"        "Javonte Green"         
[46] "Lamar Stevens"          "Dyson Daniels"          "Nassir Little"         
[49] "Garrett Temple"         "Haywood Highsmith"      "Danuel House Jr."      
[52] "Moses Moody"            "Juan Toscano-Anderson"  "Jeremiah Robinson-Earl"
[55] "David Roddy"            "Gary Payton II"         "Stanley Johnson"       
[58] "Yuta Watanabe"          "Jaime Jaquez Jr."       "James Johnson"         
[61] "Toumani Camara"         "Bilal Coulibaly"        "Peyton Watson"         
[64] "Kevin Knox II"          "Markieff Morris"        "Andre Iguodala"        
[67] "JT Thor"                "DeAndre' Bembry"        "Juancho Hernangomez"   
[70] "Max Christie"           "Romeo Langford"         "Sterling Brown"        
[73] "Kent Bazemore"          "Jake LaRavia"           "Anthony Black"         
[76] "Bryce McGowens"         "Kessler Edwards"        "Solomon Hill"          
[79] "Rodney Hood"            "Maurice Harkless"       "Anthony Lamb"          
[82] "Kris Murray"            "Edmond Sumner"          "Vince Williams Jr."    
[85] "PJ Dozier"              "Vit Krejci"             "Vlatko Cancar"         
[88] "CJ Elleby"              "Semi Ojeleye"           "Nikola Jovic"          
[91] "MarJon Beauchamp"       "Ish Wainright"          "Trevor Ariza"          
[94] "Jalen Wilson"           "Dalen Terry"            "Dante Exum"            
[97] "Johnny Davis"           "Ousmane Dieng"          "Joshua Primo"          

Let’s pull the top features explaining this cluster assignment:

target_cluster_df <- nba_scaled_full |>
    mutate(target = factor(ifelse(cluster == "7", 1, 0))) |>
    select(-cluster)

get_elasnet_top_features(target_cluster_df)
# A tibble: 10 × 2
   term                                  estimate
   <chr>                                    <dbl>
 1 (Intercept)                             -4.12 
 2 free_two_pt_shooting_fouls_drawn_pct     0.413
 3 second_second_chance_shot_quality_avg    0.378
 4 scoring_efg_pct                         -0.359
 5 second_second_chance_arc3pct_assisted    0.353
 6 scoring_ts_pct                          -0.334
 7 shot_long_mid_range_pct_blocked         -0.330
 8 shot_avg2pt_shot_distance               -0.314
 9 second_second_chance_at_rim_frequency    0.282
10 second_second_chance_arc3frequency      -0.262

What sticks out is not efficient scorers. But there is some creating, getting to the free-throw line stuff that’s interesting.

ChatGPT generated the following label suggestions:

  • Two-Way Wings
  • Defensive Specialists
  • Versatile Role Players
  • Perimeter Stoppers
  • Glue Guys

There are certainly some good defenders in this list but that’s not the primary thing here. “Two-Way” could be good. “Glue” and “connector” are intriguing words. I think the point is they do a bit of everything, but specialize in non-scoring events. Let’s go with “Versatile Connector”.

# Save cluster label
cluster_labels[cluster_labels$cluster == "7", 2] <- "Versatile Connector"
cluster_labels[cluster_labels$cluster == "7", 3] <- "VC"

Cluster #8

Player names:

nba_ids[nba_ids$cluster == "8", ]$profile_name
 [1] "Buddy Hield"              "Kentavious Caldwell-Pope"
 [3] "Kevin Huerter"            "Malik Beasley"           
 [5] "Tim Hardaway Jr."         "Grayson Allen"           
 [7] "Donte DiVincenzo"         "Duncan Robinson"         
 [9] "Georges Niang"            "Cameron Johnson"         
[11] "Pat Connaughton"          "Reggie Bullock Jr."      
[13] "Max Strus"                "Corey Kispert"           
[15] "Keegan Murray"            "Justin Holiday"          
[17] "Cedi Osman"               "Luke Kennard"            
[19] "Gary Harris"              "Trey Murphy III"         
[21] "Doug McDermott"           "Patty Mills"             
[23] "Jae Crowder"              "Garrison Mathews"        
[25] "Amir Coffey"              "Jevon Carter"            
[27] "Quentin Grimes"           "Landry Shamet"           
[29] "Isaiah Joe"               "Joe Harris"              
[31] "Damion Lee"               "Sam Hauser"              
[33] "Danny Green"              "Davis Bertans"           
[35] "Wesley Matthews"          "Austin Rivers"           
[37] "Svi Mykhailiuk"           "Miles McBride"           
[39] "Bryn Forbes"              "Simone Fontecchio"       
[41] "Mike Muscala"             "Julian Champagnie"       
[43] "Cason Wallace"            "Ben McLemore"            
[45] "Isaiah Livers"            "Avery Bradley"           
[47] "Frank Jackson"            "Gradey Dick"             
[49] "Sam Merrill"              "Wayne Ellington"         
[51] "Tony Snell"               "Timothe Luwawu-Cabarrot" 
[53] "Caleb Houstan"            "Lindy Waters III"        
[55] "Keon Ellis"               "Rodney McGruder"         
[57] "Armoni Brooks"            "AJ Green"                
[59] "Ben Sheppard"            

Let’s pull the top features explaining this cluster assignment:

target_cluster_df <- nba_scaled_full |>
    mutate(target = factor(ifelse(cluster == "8", 1, 0))) |>
    select(-cluster)

get_elasnet_top_features(target_cluster_df)
# A tibble: 10 × 2
   term                                  estimate
   <chr>                                    <dbl>
 1 (Intercept)                            -3.00  
 2 shot_corner3fgm                         0.0538
 3 scoring_pts_assisted3s                  0.0534
 4 second_second_chance_corner3frequency   0.0480
 5 scoring_fg3a_pct                        0.0463
 6 shot_corner3fga                         0.0451
 7 scoring_assisted2s_pct                  0.0443
 8 shot_at_rim_pct_assisted                0.0425
 9 second_second_chance_corner3fgm         0.0419
10 second_second_chance_fg3m               0.0415

What sticks out is the orientation to perimeter activity, clearly.

ChatGPT generated the following label suggestions:

  • Catch-and-Shoot Crew
  • Perimeter Marksmen
  • Wing Snipers
  • Spot-Up Specialists
  • Floor Spacers

I think I like “Perimeter Finisher” best. It fits with some of the other language we’ve used too so it’s cohesive.

# Save cluster label
cluster_labels[cluster_labels$cluster == "8", 2] <- "Perimeter Finisher"
cluster_labels[cluster_labels$cluster == "8", 3] <- "PF"

Cluster #9

Player names:

nba_ids[nba_ids$cluster == "9", ]$profile_name
 [1] "Rudy Gobert"          "Ivica Zubac"          "Clint Capela"        
 [4] "Jakob Poeltl"         "Mason Plumlee"        "Daniel Gafford"      
 [7] "Andre Drummond"       "Mitchell Robinson"    "Steven Adams"        
[10] "Jalen Duren"          "Robert Williams III"  "Walker Kessler"      
[13] "Bismack Biyombo"      "DeAndre Jordan"       "Tristan Thompson"    
[16] "JaVale McGee"         "Isaiah Jackson"       "Jericho Sims"        
[19] "Dwight Howard"        "Willy Hernangomez"    "Day'Ron Sharpe"      
[22] "Cody Zeller"          "Moses Brown"          "Dereck Lively II"    
[25] "Hassan Whiteside"     "Bruno Fernando"       "Trayce Jackson-Davis"
[28] "Mark Williams"       

Let’s pull the top features explaining this cluster assignment:

target_cluster_df <- nba_scaled_full |>
    mutate(target = factor(ifelse(cluster == "9", 1, 0))) |>
    select(-cluster)

get_elasnet_top_features(target_cluster_df)
# A tibble: 10 × 2
   term                                  estimate
   <chr>                                    <dbl>
 1 (Intercept)                            -10.5  
 2 shot_corner3pct_assisted                -0.240
 3 rebounds_self_o_reb_pct                  0.210
 4 second_second_chance_fg3pct             -0.190
 5 shot_avg3pt_shot_distance                0.177
 6 turnovers_x3second_violations            0.166
 7 second_second_chance_arc3pct_assisted   -0.160
 8 rebounds_self_o_reb                      0.158
 9 shot_unblocked_corner3accuracy          -0.158
10 shot_corner3accuracy                    -0.157

In the opposite vein to above, this group sticks out for their presence on the interior.

ChatGPT generated the following label suggestions:

  • Rim Protectors
  • Paint Guardians
  • Defensive Anchors
  • Rebounding Specialists
  • Post Defenders

Let’s go with “Interior Anchor”. This helps describe the role on both ends of the floor. If we were just focused on “offense”, we would classify as “Interior Finisher”.

# Save cluster label
cluster_labels[cluster_labels$cluster == "9", 2] <- "Interior Anchor"
cluster_labels[cluster_labels$cluster == "9", 3] <- "IA"

Cluster #10

Player names:

nba_ids[nba_ids$cluster == "10", ]$profile_name
 [1] "Domantas Sabonis"      "Nikola Jokic"          "Julius Randle"        
 [4] "Bam Adebayo"           "Giannis Antetokounmpo" "Jonas Valanciunas"    
 [7] "Anthony Davis"         "Karl-Anthony Towns"    "Joel Embiid"          
[10] "Kristaps Porzingis"    "Alperen Sengun"        "Jusuf Nurkic"         
[13] "Zion Williamson"       "Victor Wembanyama"     "DeMarcus Cousins"     

Let’s pull the top features explaining this cluster assignment:

target_cluster_df <- nba_scaled_full |>
    mutate(target = factor(ifelse(cluster == "10", 1, 0))) |>
    select(-cluster)

get_elasnet_top_features(target_cluster_df)
# A tibble: 10 × 2
   term                            estimate
   <chr>                              <dbl>
 1 (Intercept)                       -20.7 
 2 fouls_fouls_drawn                   4.26
 3 shot_arc3pct_assisted               2.54
 4 rebounds_def_at_rim_rebound_pct     2.34
 5 misc_at_rim_off_rebounded_pct       2.30
 6 misc_blocked_long_mid_range         1.86
 7 assists_arc3assists                 1.43
 8 turnovers_dead_ball_turnovers       1.13
 9 misc_blocked_corner3               -1.07
10 fouls_charge_fouls                  1.05

What sticks out in this group is 3P shooting with a bunch of rebounding and being in the center of the action.

ChatGPT generated the following label suggestions:

  • Skilled Bigs
  • Versatile Frontcourt
  • Playmaking Centers
  • Dominant Big Men
  • All-Around Bigs

Let’s go with “Interior Engine” for the moment. It describes this idea of engaging the team’s “drivetrain” from inside-out.

# Save cluster label
cluster_labels[cluster_labels$cluster == "10", 2] <- "Interior Engine"
cluster_labels[cluster_labels$cluster == "10", 3] <- "IE"

Now we can stop the parallelization:

doParallel::stopImplicitCluster()

New Position Labels

Summary

Here’s our final clusters, with labels and abbreviations. We settled on distinguishing orientation of play and style with “interior”, “perimeter”, and “versatile” descriptors. Obviously this isn’e exclusive, a “perimeter engine” would certainly score and operate on the interior as well, but it describes their tendencies: “out-in” vs “in-out”.

Next we settled on 4 roles or styles: “connector”, “anchor”, “finisher”, and “engine”. Again, these aren’t exclusive but indicate where players tend and lean in their overall style and assumed roles.

cluster_labels |>
    arrange(label)
# A tibble: 10 × 3
   cluster label               abbrev
   <fct>   <chr>               <chr> 
 1 9       Interior Anchor     IA    
 2 5       Interior Connector  IC    
 3 10      Interior Engine     IE    
 4 6       Perimeter Anchor    PA    
 5 4       Perimeter Engine    PE    
 6 8       Perimeter Finisher  PF    
 7 1       Versatile Anchor    VA    
 8 7       Versatile Connector VC    
 9 3       Versatile Engine    VE    
10 2       Versatile Finisher  VF    

Let’s go through some exercises where we intersect these labels with the original data. Now that these labels have meaning, we can benchmark more easily against domain knowledge.

nba_adjusted_full <- nba_adjusted_full |>
    inner_join(cluster_labels) |>
    bind_cols(nba_ids)

Player Distributions

How many players across the NBA fall into these groups? We’d expect “Engines” to have the fewest, the “Versatile” group to be the largest sum overall. Let’s see if that matches up with preconceptions.

nba_adjusted_full |>
    group_by(label, abbrev) |>
    count() |>
    ungroup() |>
    mutate(perc = n / sum(n))
# A tibble: 10 × 4
   label               abbrev     n  perc
   <chr>               <chr>  <int> <dbl>
 1 Interior Anchor     IA        28 0.056
 2 Interior Connector  IC        52 0.104
 3 Interior Engine     IE        15 0.03 
 4 Perimeter Anchor    PA        58 0.116
 5 Perimeter Engine    PE        38 0.076
 6 Perimeter Finisher  PF        59 0.118
 7 Versatile Anchor    VA        37 0.074
 8 Versatile Connector VC        99 0.198
 9 Versatile Engine    VE        50 0.1  
10 Versatile Finisher  VF        64 0.128

Our theories held pretty well. “Engines” are about the smallest of each of their respective groups. The “Versatile” group is the largest of them all.

League Distributions

We have labels for the “current” teams. Let’s see how many of each of these belong to each team.

NOTE: data gathered represents only the top 500 players by minutes over the last 4.5 seasons; therefore, some teams will see fewer than others and some players don’t actively belong to a team

nba_adjusted_full |> 
    group_by(profile_team_abbreviation, abbrev) |>
    count() |>
    ungroup() |>
    arrange(abbrev) |>
    pivot_wider(
        names_from = abbrev, 
        values_from = n, 
        values_fill = 0
    ) |>
    gt::gt()
profile_team_abbreviation IA IC IE PA PE PF VA VC VE VF
ATL 1 1 0 1 1 3 1 4 4 2
BKN 1 2 0 2 1 2 3 4 1 3
CHA 1 2 0 2 1 1 2 3 2 2
CLE 1 3 0 3 2 3 3 3 1 0
DAL 2 1 0 4 2 1 0 6 2 0
DEN 1 1 2 0 1 1 1 3 0 3
DET 1 3 0 2 1 6 2 2 1 2
GSW 1 2 0 2 1 2 2 4 1 3
HOU 1 3 1 2 0 1 1 1 2 4
IND 2 1 0 3 0 1 1 3 0 3
LAC 1 1 0 2 4 1 1 7 1 0
LAL 1 2 1 5 1 2 0 5 1 2
MIN 1 0 1 3 1 3 1 2 0 1
NOP 2 1 1 2 2 2 1 3 3 0
NYK 2 1 1 1 1 1 0 1 2 1
OKC 1 2 0 0 1 3 0 4 2 3
PHI 1 2 1 3 1 1 0 4 2 2
PHX 1 1 1 2 3 2 2 4 2 1
POR 1 2 0 1 0 1 2 5 2 3
SAC 1 1 1 2 2 5 0 2 2 2
TOR 2 2 0 2 0 1 0 4 3 4
UTA 2 2 0 2 2 3 0 0 2 4
BOS 0 3 1 3 2 1 0 2 0 1
CHI 0 3 0 2 1 1 2 4 1 1
MEM 0 2 0 1 1 1 3 5 2 2
MIA 0 1 1 1 1 1 1 3 4 1
MIL 0 2 1 2 2 2 0 4 2 3
ORL 0 3 0 1 1 3 3 2 2 1
WAS 0 2 1 1 1 2 3 2 1 4
SAS 0 0 1 1 1 2 2 3 2 6

It’s pretty interesting to see different approaches.

  • Interior
    • ORL (Orlando Magic): no Engines or Anchors, but 3 Interior Connectors (Wendell Carter Jr., Goga Bitadze, Jonathan Isaac)
    • SAS (San Antonio Spurs): no Anchors or Connectors and only 1 Engine (Victor Wembanyama)
  • Perimeter
    • CLE: 8 players total, 3 of which are Perimeter Finishers (Georges Niang, Max Strus, Sam Merrill)
    • DEN: only 2 players (Finisher: Justin Holiday, Engine: Jamal Murray) and no Perimeter Anchors
  • Versatile
    • BOS (Boston Celtics): only 3 players and no Engines or Anchors
    • SAS (San Antonio Spurs): a total of 13, 6 of which are Versatile Finishers

Team Distributions

Let’s take a team like Boston and see which of our new position labels are getting minutes. Let’s workup a quick function for that:

peek_team_dist <- function(team) {
    nba_adjusted_full |>
        filter(profile_team_abbreviation == team) |>
        mutate(mp_gm = profile_minutes / profile_games_played) |>
        arrange(desc(mp_gm)) |>
        select(profile_name, label, abbrev, mp_gm) |>
        gt::gt()
}
peek_team_dist("BOS")
profile_name label abbrev mp_gm
Jayson Tatum Perimeter Engine PE 36.13183
Jaylen Brown Perimeter Engine PE 34.47500
Jrue Holiday Perimeter Anchor PA 32.56537
Kristaps Porzingis Interior Engine IE 30.58824
Derrick White Perimeter Anchor PA 30.30796
Al Horford Versatile Connector VC 28.64198
Enes Freedom Interior Connector IC 20.27103
Blake Griffin Versatile Finisher VF 18.97902
Payton Pritchard Perimeter Anchor PA 18.60481
Oshae Brissett Versatile Connector VC 18.25000
Sam Hauser Perimeter Finisher PF 17.71220
Xavier Tillman Sr. Interior Connector IC 16.85950
Luke Kornet Interior Connector IC 12.98469

Really interesting. Boston is led by their Engines, then Anchors, and then some Connectors.

Let’s try another team, say the Portland Trailblazers:

peek_team_dist("POR")
profile_name label abbrev mp_gm
Jerami Grant Versatile Engine VE 33.82917
Deandre Ayton Interior Connector IC 30.67669
Anfernee Simons Perimeter Anchor PA 28.69600
Scoot Henderson Versatile Anchor VA 27.98734
Deni Avdija Versatile Finisher VF 26.22508
Toumani Camara Versatile Connector VC 26.15957
Shaedon Sharpe Versatile Engine VE 25.98438
Robert Williams III Interior Anchor IA 23.99379
Matisse Thybulle Versatile Connector VC 21.23970
Justise Winslow Versatile Finisher VF 19.98058
Kris Murray Versatile Connector VC 19.45783
Ben McLemore Perimeter Finisher PF 18.71795
Jabari Walker Interior Connector IC 17.17007
Duop Reath Versatile Finisher VF 16.01235
Bryce McGowens Versatile Connector VC 15.76415
CJ Elleby Versatile Connector VC 15.53409
Dalano Banton Versatile Anchor VA 14.02924

Here we see some differences, with Connectors featured a little more towards the top wich a lot of Versatile type players as opposed to Perimeter or Interior focused.

Next Steps

There are many avenues to take this analysis. We could analyze impact of each position on winning, understand career earnings through the lens of these new positions, and much more.

Where we are going to take the analysis is in the direction of understanding the style and role of incoming NBA prospects. One of the toughest parts of scouting the next wave of talent is judging how their game translates to the professional level.

How do the first round talents in the upcoming 2025 NBA draft project across our new positions?

This next phase requires:

  • Pre-NBA measures for as many players from our cluster assigned data set as possible
    • This will be our “training” set; we’ll intersect pre-NBA stats with the derived positions so far
    • NOTE: some players lack sufficient pre-NBA data due to coming directly from high school or playing internationally. These data sets are spotty and hard to access. For our purposes, we’ll concentrate on players who played in the NCAA (collegiate basketball league in the United States) prior to being drafted in the NBA. This ensures we have consistent data for modeling the relationship of pre-NBA performance and eventual NBA positions. It helps that roughly 80% of our original player list will be featured in the training set.
  • Active collegiate player measures
    • We’ll also want a “testing” set that features currently active collegiate players who have yet to play in the NBA. We’ll use the first round projections from No Ceilings
    • For reasons explained previously, we won’t collect measures for any international prospects ranked by No Ceilings (these are just 4 of the 30 first round prospects projected by No Ceilings)

Projecting Incoming Prospects

Data Prep

We’ll first load our “training” data (collegiate data for active NBA players) and intersect with the new positions:

mbb_data_raw <- read.csv('../bballrefstats/college-players.csv')
mbb_data_pos <- mbb_data_raw |>
    inner_join(
        nba_adjusted_full |> select(profile_name, cluster_label = label, cluster_abbrev = abbrev), 
        by = join_by(player_name == profile_name)
    )

We’ve got some NAs in our data:

# Check for missing or empty values
mbb_na <- sapply(mbb_data_pos, function(col) {
    sum(is.na(col) | is.null(col) | col == "", na.rm = TRUE)
})

names(which(mbb_na > 0))
 [1] "col_3pp"  "col_obpm" "col_dbpm" "col_bpm"  "col_per"  "col_orbp"
 [7] "col_drbp" "col_trbp" "col_astp" "col_stlp" "col_blkp" "col_usgp"

We do have some missing values. Why is that? There’s two primary reasons:

  1. Calculation limiation: three-point percentage requires 3P attempts. Where there are none, there is no possible value.
  2. Era limitations: not all measures throughout basketball history have been available due to tracking technology evlolving over time.

How do we resolve?

Well for #1, we’ll impute as zero. Since we account for volume via another measure (3PA), this shouldn’t be an issue.

mbb_fillna_1 <- mbb_data_pos |>
    mutate(col_3pp = ifelse(is.na(col_3pp), 0, col_3pp))

For the situation in #2, all players will have a value at least zero or greater. While not “captured” at the point in time, had the technology been there the values would be represented. Therefore, by imputing the values, we stick with best practice of determing what the values would have been. Let’s do some kNN imputation!

mbb_imputed <- kNN(
    mbb_fillna_1, 
    variable = setdiff(names(which(mbb_na > 0)), "col_3pp"), 
    k = 5
) |>
    select(!contains("_imp"))

And let’s confirm we have fully complete data:

# Check for missing or empty values
mbb_na_2 <- sapply(mbb_imputed, function(col) {
    sum(is.na(col) | is.null(col) | col == "", na.rm = TRUE)
})

any(mbb_na_2 > 0)
[1] FALSE

And now, our training data is fully cleaned up! Let’s bring in our testing data, or the “incoming prospects”.

mbb_prospects <- read.csv('../bballrefstats/incoming-prospects.csv')
# Check for missing or empty values
any(sapply(mbb_prospects, function(col) {
    sum(is.na(col) | is.null(col) | col == "", na.rm = TRUE)
}) > 0)
[1] FALSE

Perfect! And now, let’s just check to confirm we have similar columns:

setdiff(names(mbb_prospects), names(mbb_imputed))
[1] "draft_projection"

The only column in our testing data that isn’t in training is draft_projection, which is a context only field. Let’s prep the models!

Modeling Prep

With our prepped data, we’re ready to start defining the models and cross validation infastructure needed. We’ll use two models: linear regression and a boosted trees approach (XGBoost).

We can evaluate feature importance with both but in different ways. They also are different approaches, the former being in the ordinary least squares family while the latter is tree-based. In this way, we can validate results.

Model Definition

mod_nn <- mlp(
    hidden_units = tune(), 
    penalty = tune(), 
    epochs = tune()
) |>
    set_engine("nnet") |>
    set_mode("classification")

mod_xg <- boost_tree(
    mtry = tune(), 
    trees = tune(), 
    tree_depth = tune(), 
    learn_rate = tune()
) |>
    set_engine("xgboost") |>
    set_mode("classification")

Cross Validation

With cross validation, we’ll be able to confirm that the performance results we’re getting from the model aren’t due to chance. We’ll setup a 5 fold cross validation.

mod_cv <- rsample::vfold_cv(mbb_imputed, v = 5)

We aren’t splitting into training and testing since we’re only concerned with making inference about the relationships. V fold cross validation will set up training and testing splits for us so those results will all us to test on untrained data anyway.

Recipe

Our recipe is fairly straight forward but we will put some extra preprocessing steps in there. In short, we want to predict team_net_rating using all of the positional labels.

mod_recipe <- recipe(cluster_label ~ ., mbb_imputed) |>
    update_role(player_name, cluster_abbrev, new_role = "id") |>
    step_dummy(school_conf) |>
    step_normalize(is.numeric) |>
    step_pca(is.numeric)

Hyperparameters

mod_nn_grid <- grid_regular(
    hidden_units(), 
    penalty(), 
    epochs(), 
    levels = 4
)

mod_xg_grid <- grid_regular(
    trees(), 
    tree_depth(), 
    learn_rate(), 
    mtry(c(1, ceiling(sqrt(30)))), 
    levels = 4
)

Fitting the Models

Parallelization:

# Define parallelization
cores_target <- ceiling(parallel::detectCores() * 0.75)
doParallel::registerDoParallel(cores = cores_target)

set.seed(814)

Boosted Trees

# Configure workflow
mod_wflw_xg <-
    workflow() |>
    add_model(mod_xg) |>
    add_recipe(mod_recipe)

# Run cross-validated tuning
set.seed(814)
mod_tune_xg <-
    mod_wflw_xg |>
    tune_grid(
        resamples = mod_cv,
        grid = mod_xg_grid,
        metrics = metric_set(roc_auc)
    )

Neural Network

Now let’s tune the hyperparameters on the neural network using cross-validation. Just as before, we’ll setup a workflow with the model and recipe, then tune against the neural network grid we setup previously.

# Configure workflow
mod_wflw_nn <-
    workflow() |>
    add_model(mod_nn) |>
    add_recipe(mod_recipe)

# Run cross-validated tuning
mod_tune_nn <-
    mod_wflw_nn |>
    tune_grid(
        resamples = mod_cv,
        grid = mod_nn_grid,
        metrics = metric_set(roc_auc)
    )

Comparing the Models

With both of those tuned, let’s compare the top 5 configurations for each model:

mod_tune_xg |> collect_metrics() |> slice_max(mean, n = 5)
# A tibble: 5 × 10
   mtry trees tree_depth learn_rate .metric .estimator  mean     n std_err
  <int> <int>      <int>      <dbl> <chr>   <chr>      <dbl> <int>   <dbl>
1     4   667          1        0.1 roc_auc hand_till  0.815     5 0.00691
2     2   667          1        0.1 roc_auc hand_till  0.814     5 0.00731
3     1   667          1        0.1 roc_auc hand_till  0.813     5 0.00779
4     6   667          1        0.1 roc_auc hand_till  0.813     5 0.00716
5     6  1333          1        0.1 roc_auc hand_till  0.804     5 0.00817
# ℹ 1 more variable: .config <chr>

The XGBoost model is producing “AUC” values of right around 0.81. We tuned 4 hyperparameters, and we see some common themes: mtry of 1 or 2, tree_depth of 5, and learn_rate of 0.1e^-6. That’s a pretty performant model. Let’s plot all values:

mod_tune_nn |> collect_metrics() |> slice_max(mean, n = 5)
# A tibble: 5 × 9
  hidden_units penalty epochs .metric .estimator  mean     n std_err .config    
         <int>   <dbl>  <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>      
1           10       1    670 roc_auc hand_till  0.854     5 0.00621 Preprocess…
2           10       1   1000 roc_auc hand_till  0.853     5 0.00641 Preprocess…
3            7       1   1000 roc_auc hand_till  0.852     5 0.00741 Preprocess…
4            7       1    340 roc_auc hand_till  0.852     5 0.00468 Preprocess…
5           10       1    340 roc_auc hand_till  0.851     5 0.00720 Preprocess…

The Neural Network model is producing “AUC” values of around 0.85. We tuned 3 hyperparameters. The only real common theme is a penalty of 1. We’re also seeing about half the std_err as we had with XGBoost. Additionally, the time to fit the model was quite a big lower with Neural Network.

By exploring both of these methods with extensive cross-validation and hyperparameter tuning, we’ve done extensive validation of our results.

Let’s proceed with its best configuration for predicting on the new prospects. Let’s make the final fit with these hyperparameters and all of the training data:

# Select & fit best model
best_mod_nn <- mod_tune_nn |> select_best(metric = "roc_auc")
final_wflw_nn <- mod_wflw_nn |> finalize_workflow(best_mod_nn)
final_fit_nn <- fit(final_wflw_nn, data = mbb_imputed)

Predict for Prospects

Let’s now:

mbb_predictions <- predict(
    final_fit_nn, 
    mbb_prospects |> select(-draft_projection) |> mutate(cluster_abbrev = "X")
)

And here we have our predictions w/ prospects:

classified_prospects <- 
    mbb_prospects |>
    mutate(cluster_label = mbb_predictions$.pred_class) |>
    select(draft_projection, player_name, cluster_label)

classified_prospects
   draft_projection          player_name       cluster_label
1                 1         Cooper Flagg Versatile Connector
2                 2         Dylan Harper    Perimeter Engine
3                 3           Ace Bailey Versatile Connector
4                 4           Egor Demin    Perimeter Anchor
5                 5          Tre Johnson  Perimeter Finisher
6                 6         VJ Edgecombe    Versatile Anchor
7                 7  Kasparas Jakucionis    Perimeter Anchor
8                 8       Khaman Maluach  Interior Connector
9                 9        Liam McNeeley  Perimeter Finisher
10               10           Asa Newell  Versatile Finisher
11               11         Kon Knueppel  Perimeter Finisher
12               13 Collin Murray-Boyles  Interior Connector
13               15          Derik Queen  Interior Connector
14               18         Boogie Fland    Perimeter Anchor
15               19       Nique Clifford Versatile Connector
16               20         Alex Karaban  Perimeter Finisher
17               21           Will Riley  Perimeter Finisher
18               22       Labaron Philon    Versatile Anchor
19               23        Hunter Sallis  Perimeter Finisher
20               24             KJ Lewis Versatile Connector
21               25         Drake Powell Versatile Connector
22               26          Ian Jackson  Perimeter Finisher
23               27      Kanon Catchings Versatile Connector
24               28        Carter Bryant Versatile Connector
25               29         Jalil Bethea  Perimeter Finisher
26               30     Mackenzie Mgbako Versatile Connector

Let’s take a peak at the distribution:

classified_prospects |>
    count(cluster_label) |>
    arrange(desc(n))
        cluster_label n
1  Perimeter Finisher 8
2 Versatile Connector 8
3  Interior Connector 3
4    Perimeter Anchor 3
5    Versatile Anchor 2
6    Perimeter Engine 1
7  Versatile Finisher 1

This first round features a lot of Perimeter Finishers and Versatile Connectors (8 prospects each). This would be extremely helpful information as teams look to prioritize prospects during the scouting season and into the draft process of scheduling workouts, interviews, and ultimately selecting a prospect.