library("tidyverse") # Used for general data manipulation and visualization
library("tidymodels") # Used for its modeling framework
library("tidyclust") # Used for clustering approaches
library("dataPreparation") # Used for scaling, where needed
library("kernlab") # Weighted kernal k-means
library("FactoMineR") # Used for PCA
library('VIM') # Used for training data imputationRedefining NBA positions and classifying incoming NBA prospects
Prompts
Prompt #1
Which machine learning methods did you implement?
I chose to implement the following machine learning methods in my final project:
- Principal Component Analysis (PCA)
- KMeans Clustering
- Hierarchical Clustering
- Elastic Net Logistic Regression
- Neural Network
- Boosted Trees
Prompt #2
Discuss the key contribution of each method to your analysis. If a method didn’t contribute, discuss why it didn’t. A sentence or two for each method is plenty.
Principal Component Analysis (PCA): Used for the purpose of more performant clustering. Reduced dimensionality of source data from 195 features to 57 principal components while retaining 95% of overall variance.
KMeans Clustering: Used for the purpose of determing the proper number of clusters. Used the “within cluster sum of squares” measure to evaluate a range of cluster numbers, finding the optimal number via plotting for the elbow point. This was the method used for cluster assignments.
Hierarchical Clustering: Used for the purpose of validating the proper number of clusters. Used the “within cluster sum of squared error” measure to evaluate a range of cluster nujmbers, finding the optimal number via elbow point.
Elastic Net Logistic Regression: Used for the purpose of exploring top features that explain the unique characteristics of each cluster. Through these insights, I derived meaningful cluster labels. Tuned
lambdaandalpha(mixture) hyperparameters via cross-validation and minimized the performance metric Root Mean Squared Error.Neural Network: Used for the purpose of predicting cluster labels for test data. Tuned
penalty,hidden_units, andepochshyperparameters for maximizing performance metric AUC on training data. Chosen as final model for prediction.Boosted Trees: Used for the purpose of validating performance results of Neural Network model. Tuned
mtry,trees,tree_depth, andlearn_ratehyperparameters for maximizing performance metric AUC on training data.
Prompt #3
Did all methods support your conclusions or did some provide conflicting results? If so they provided conflicting results, how did you reconcile the differences?
- Clustering:
- KMeans and Hierarchical clustering were both employed to validate results.
- Both gave similar results, but KMeans proved more consistent.
- Hierarchical clustering via
mclustsaw a far more jagged and harder to read elbow point; in some instances, there were large portions of the line with marginal reductions in “WSSE” for each additional cluster. - KMeans had a nice and smooth, far easier to spot elbow point. With hierarchical clustering confirming KMeans was reasonable, chose to use this appraoch for assigning clusters.
- Final Prediction Models:
- Neural Network and XGBoost were both very performant, producing AUC values exceeding
0.81. - Neural Network fit across the cross-validated tuning grid MUCH faster than XGBoost, which was surprising.
- Did not observe any further conflicting results.
- Neural Network and XGBoost were both very performant, producing AUC values exceeding
Assignment Workflow
Data Prep
Load Libraries & Data Set
We’ll use the following libraries and their dependencies
setwd("full-projects/nba-player-position-roles/R")Let’s import the source data:
nba_stats <- read.csv("../nba-player-data.csv")The source data is comprized of ~200 features relating to the top 500 NBA players in minutes played over the past 5 seasons. The measures encompass offensive and defensive activity, including shooting, passing, rebounding, etc. We can review the measures below:
glimpse(nba_stats)Rows: 500
Columns: 201
$ assists_assists <int> 987, 1448, 1334, 1676, 110…
$ assists_assist_points <int> 2404, 3675, 3345, 4160, 27…
$ assists_two_pt_assists <int> 557, 669, 657, 868, 586, 1…
$ assists_three_pt_assists <int> 430, 779, 677, 808, 519, 1…
$ assists_at_rim_assists <int> 339, 405, 449, 530, 348, 6…
$ assists_short_mid_range_assists <int> 160, 193, 169, 234, 106, 2…
$ assists_long_mid_range_assists <int> 58, 71, 39, 104, 132, 174,…
$ assists_corner3assists <int> 176, 264, 233, 228, 173, 2…
$ assists_arc3assists <int> 254, 515, 444, 580, 346, 7…
$ fouls_fouls <int> 571, 658, 666, 689, 747, 1…
$ fouls_shooting_fouls <int> 286, 231, 268, 363, 469, 4…
$ fouls_loose_ball_fouls <int> 23, 33, 23, 26, 45, 141, 4…
$ fouls_loose_ball_fouls_drawn <int> 25, 35, 14, 29, 80, 152, 4…
$ fouls_offensive_fouls <int> 17, 80, 34, 27, 71, 155, 6…
$ fouls_charge_fouls <int> 5, 51, 41, 23, 15, 45, 56,…
$ fouls_offensive_fouls_drawn <int> 85, 25, 13, 59, 13, 31, 29…
$ fouls_charge_fouls_drawn <int> 2, 0, 1, 22, 4, 7, 5, 10, …
$ fouls_clear_path_fouls <int> 0, 2, 2, 1, 0, 0, 0, 0, 0,…
$ fouls_transition_take_fouls <int> 1, 5, 4, 6, 3, 1, 3, 2, 1,…
$ fouls_transition_take_fouls_drawn <int> 2, 4, 3, 1, 0, 2, 2, 8, 2,…
$ fouls_defensive_3_seconds_violations <int> 0, 3, 10, 5, 2, 9, 12, 24,…
$ fouls_fouls_drawn <int> 829, 1471, 1084, 1681, 687…
$ free_fta <int> 979, 2103, 1581, 2293, 625…
$ free_technical_free_throw_trips <int> 31, 122, 44, 84, 22, 0, 6,…
$ free_two_pt_shooting_fouls_drawn <int> 422, 781, 750, 1109, 273, …
$ free_x2pt_and_1_free_throw_trips <int> 119, 172, 211, 289, 85, 20…
$ free_three_pt_shooting_fouls_drawn <int> 23, 84, 15, 16, 0, 2, 2, 1…
$ free_x3pt_and_1_free_throw_trips <int> 5, 12, 4, 9, 0, 0, 1, 2, 0…
$ free_non_shooting_fouls_drawn <int> 84, 187, 107, 122, 71, 147…
$ free_shooting_fouls_drawn_pct <dbl> 0.09357218, 0.12273441, 0.…
$ free_two_pt_shooting_fouls_drawn_pct <dbl> 0.14506703, 0.18303258, 0.…
$ free_three_pt_shooting_fouls_drawn_pct <dbl> 0.011982571, 0.028832117, …
$ misc_plus_minus <int> 918, 2130, 553, 35, -653, …
$ misc_on_off_rtg <dbl> 117.4043, 120.2431, 115.39…
$ misc_on_def_rtg <dbl> 113.4244, 111.2182, 113.10…
$ misc_first_chance_points <int> 5207, 7913, 7005, 7063, 54…
$ misc_three_pt_off_rebounded_pct <dbl> 0.2243415, 0.2360437, 0.21…
$ misc_at_rim_off_rebounded_pct <dbl> 0.3426573, 0.3629191, 0.33…
$ misc_short_mid_range_off_rebounded_pct <dbl> 0.2957983, 0.3379501, 0.31…
$ misc_long_mid_range_off_rebounded_pct <dbl> 0.1742424, 0.2364865, 0.21…
$ misc_blocks <int> 202, 186, 196, 131, 255, 1…
$ misc_blocked2s <int> 192, 165, 180, 129, 249, 1…
$ misc_blocked3s <int> 10, 21, 16, 2, 6, 3, 18, 0…
$ misc_blocked_at_rim <int> 113, 76, 97, 73, 164, 92, …
$ misc_blocked_short_mid_range <int> 72, 76, 74, 54, 83, 59, 48…
$ misc_blocked_long_mid_range <int> 7, 13, 9, 2, 2, 3, 1, 6, 1…
$ misc_blocked_corner3 <int> 3, 3, 6, 1, 1, 2, 6, 0, 1,…
$ misc_blocked_arc3 <int> 7, 18, 10, 1, 5, 1, 12, 0,…
$ misc_recovered_blocks <int> 120, 106, 107, 73, 157, 10…
$ misc_blocks_recovered_pct <dbl> 0.5940594, 0.5698925, 0.54…
$ misc_steals <int> 369, 331, 445, 326, 265, 2…
$ misc_lost_ball_steals <int> 85, 115, 178, 101, 67, 78,…
$ misc_bad_pass_steals <int> 284, 216, 267, 225, 198, 2…
$ misc_defensive_goaltends <int> 3, 9, 6, 1, 4, 9, 5, 0, 1,…
$ profile_name <chr> "Mikal Bridges", "Jayson T…
$ profile_team_abbreviation <chr> "NYK", "BOS", "MIN", "SAC"…
$ profile_games_played <int> 342, 311, 325, 310, 326, 3…
$ profile_minutes <int> 11898, 11237, 11229, 11182…
$ profile_height_in <int> 78, 80, 76, 78, 82, 82, 80…
$ profile_weight_lbs <int> 209, 210, 225, 220, 260, 2…
$ rebounds_rebounds <int> 1475, 2535, 1691, 1413, 35…
$ rebounds_def_rebounds <int> 1156, 2243, 1457, 1230, 28…
$ rebounds_ft_def_rebounds <int> 53, 143, 32, 69, 221, 218,…
$ rebounds_def_ft_rebound_pct <dbl> 0.09397163, 0.29183673, 0.…
$ rebounds_def_two_pt_rebounds <int> 508, 1005, 638, 569, 1432,…
$ rebounds_def_two_pt_rebound_pct <dbl> 0.08090460, 0.17065716, 0.…
$ rebounds_def_three_pt_rebounds <int> 595, 1095, 787, 592, 1210,…
$ rebounds_def_three_pt_rebound_pct <dbl> 0.11639280, 0.21361686, 0.…
$ rebounds_def_fg_rebound_pct <dbl> 0.09683083, 0.19064911, 0.…
$ rebounds_off_rebounds <int> 319, 292, 234, 183, 706, 9…
$ rebounds_ft_off_rebounds <int> 5, 3, 8, 5, 6, 11, 8, 6, 9…
$ rebounds_off_ft_rebound_pct <dbl> 0.010845987, 0.006726457, …
$ rebounds_off_two_pt_rebounds <int> 137, 147, 120, 126, 501, 6…
$ rebounds_off_two_pt_rebound_pct <dbl> 0.02258490, 0.03072100, 0.…
$ rebounds_off_three_pt_rebounds <int> 177, 142, 106, 52, 199, 31…
$ rebounds_off_three_pt_rebound_pct <dbl> 0.03506339, 0.02479050, 0.…
$ rebounds_off_fg_rebound_pct <dbl> 0.02825265, 0.02748977, 0.…
$ rebounds_def_at_rim_rebound_pct <dbl> 0.06641545, 0.13491635, 0.…
$ rebounds_def_short_mid_range_rebound_pct <dbl> 0.07489598, 0.17822142, 0.…
$ rebounds_def_long_mid_range_rebound_pct <dbl> 0.11871069, 0.20608899, 0.…
$ rebounds_def_arc3rebound_pct <dbl> 0.11399351, 0.21528977, 0.…
$ rebounds_def_corner3rebound_pct <dbl> 0.12511333, 0.20728291, 0.…
$ rebounds_off_at_rim_rebound_pct <dbl> 0.02734148, 0.03848467, 0.…
$ rebounds_off_short_mid_range_rebound_pct <dbl> 0.01847334, 0.03116279, 0.…
$ rebounds_off_long_mid_range_rebound_pct <dbl> 0.025033829, 0.016460905, …
$ rebounds_off_arc3rebound_pct <dbl> 0.03591009, 0.02408283, 0.…
$ rebounds_off_corner3rebound_pct <dbl> 0.032857143, 0.027237354, …
$ rebounds_self_o_reb <int> 32, 103, 74, 62, 110, 154,…
$ rebounds_self_o_reb_pct <dbl> 0.01536246, 0.03385930, 0.…
$ scoring_off_poss <int> 24063, 22788, 23186, 22708…
$ scoring_points <int> 5792, 8598, 7527, 7597, 62…
$ scoring_fg2m <int> 1447, 1948, 1798, 2500, 20…
$ scoring_fg2a <int> 2606, 3658, 3548, 4818, 37…
$ scoring_fg2pct <dbl> 0.5552571, 0.5325314, 0.50…
$ scoring_fg3m <int> 690, 974, 892, 202, 548, 1…
$ scoring_fg3a <int> 1819, 2673, 2473, 626, 154…
$ scoring_fg3pct <dbl> 0.3793293, 0.3643846, 0.36…
$ scoring_non_heave_fg3pct <dbl> 0.3815061, 0.3647940, 0.36…
$ scoring_ft_points <int> 828, 1780, 1255, 1991, 513…
$ scoring_pts_assisted2s <int> 1816, 1624, 1280, 1262, 29…
$ scoring_pts_unassisted2s <int> 1078, 2272, 2316, 3738, 12…
$ scoring_pts_assisted3s <int> 1932, 1575, 1482, 525, 163…
$ scoring_pts_unassisted3s <int> 138, 1347, 1194, 81, 6, 18…
$ scoring_assisted2s_pct <dbl> 0.6275052, 0.4168378, 0.35…
$ scoring_non_putbacks_assisted2s_pct <dbl> 0.6570188, 0.4344569, 0.36…
$ scoring_assisted3s_pct <dbl> 0.9333333, 0.5390144, 0.55…
$ scoring_fg3a_pct <dbl> 0.411073446, 0.422208182, …
$ scoring_shot_quality_avg <dbl> 0.5310678, 0.5169488, 0.53…
$ scoring_efg_pct <dbl> 0.5609040, 0.5384615, 0.52…
$ scoring_ts_pct <dbl> 0.5967909, 0.5903598, 0.56…
$ scoring_pts_putbacks <int> 130, 158, 78, 50, 442, 524…
$ scoring_fg2a_blocked <int> 156, 272, 239, 289, 201, 2…
$ scoring_fg2a_pct_blocked <dbl> 0.05986186, 0.07435757, 0.…
$ scoring_fg3a_blocked <int> 10, 11, 7, 6, 2, 3, 2, 2, …
$ scoring_fg3a_pct_blocked <dbl> 0.005497526, 0.004115226, …
$ scoring_usage <dbl> 19.56094, 31.35144, 29.148…
$ second_second_chance_off_poss <int> 2532, 2479, 2497, 2241, 20…
$ second_second_chance_points <int> 585, 685, 522, 534, 805, 1…
$ second_second_chance_points_pct <dbl> 0.10100138, 0.07966969, 0.…
$ second_second_chance_fg2m <int> 145, 172, 106, 178, 327, 3…
$ second_second_chance_fg2a <int> 252, 318, 238, 324, 583, 6…
$ second_second_chance_fg2pct <dbl> 0.5753968, 0.5408805, 0.44…
$ second_second_chance_fg3m <int> 79, 78, 79, 16, 23, 13, 37…
$ second_second_chance_fg3a <int> 181, 209, 185, 47, 77, 40,…
$ second_second_chance_fg3pct <dbl> 0.4364641, 0.3732057, 0.42…
$ second_second_chance_ft_points <int> 58, 107, 73, 130, 82, 213,…
$ second_second_chance_efg_pct <dbl> 0.6085450, 0.5483871, 0.53…
$ second_second_chance_ts_pct <dbl> 0.6334056, 0.5862069, 0.56…
$ second_second_chance_shot_quality_avg <dbl> 0.5340895, 0.5291019, 0.53…
$ second_second_chance_at_rim_fgm <int> 78, 127, 82, 59, 259, 344,…
$ second_second_chance_at_rim_fga <int> 110, 200, 135, 87, 426, 53…
$ second_second_chance_at_rim_frequency <dbl> 0.2540416, 0.3795066, 0.31…
$ second_second_chance_at_rim_accuracy <dbl> 0.7090909, 0.6350000, 0.60…
$ second_second_chance_at_rim_pct_assisted <dbl> 0.32051282, 0.25984252, 0.…
$ second_second_chance_corner3fgm <int> 51, 17, 17, 3, 1, 2, 9, 0,…
$ second_second_chance_corner3fga <int> 102, 44, 45, 12, 7, 4, 21,…
$ second_second_chance_corner3frequency <dbl> 0.235565820, 0.083491461, …
$ second_second_chance_corner3accuracy <dbl> 0.5000000, 0.3863636, 0.37…
$ second_second_chance_corner3pct_assisted <dbl> 0.9607843, 0.8235294, 1.00…
$ second_second_chance_arc3fgm <int> 28, 61, 62, 13, 22, 11, 28…
$ second_second_chance_arc3fga <int> 79, 165, 140, 35, 70, 36, …
$ second_second_chance_arc3frequency <dbl> 0.18244804, 0.31309298, 0.…
$ second_second_chance_arc3accuracy <dbl> 0.3544304, 0.3696970, 0.44…
$ second_second_chance_arc3pct_assisted <dbl> 0.9642857, 0.7704918, 0.69…
$ second_second_chance_turnovers <int> 34, 43, 44, 26, 39, 73, 40…
$ shot_shot_quality_avg <dbl> 0.5310678, 0.5169488, 0.53…
$ shot_at_rim_fg3a_frequency <dbl> 0.6352542, 0.6907282, 0.71…
$ shot_avg2pt_shot_distance <dbl> 7.803492, 7.222635, 6.7562…
$ shot_avg3pt_shot_distance <dbl> 24.83178, 26.09386, 25.710…
$ shot_at_rim_fgm <int> 704, 1189, 1178, 715, 865,…
$ shot_at_rim_fga <int> 992, 1700, 1848, 1089, 128…
$ shot_at_rim_frequency <dbl> 0.2241808, 0.2685200, 0.30…
$ shot_at_rim_accuracy <dbl> 0.7096774, 0.6994118, 0.63…
$ shot_unblocked_at_rim_accuracy <dbl> 0.7652174, 0.7612036, 0.68…
$ shot_at_rim_pct_assisted <dbl> 0.7301136, 0.4566863, 0.39…
$ shot_at_rim_pct_blocked <dbl> 0.07258064, 0.08117647, 0.…
$ shot_short_mid_range_fgm <int> 572, 459, 381, 851, 876, 4…
$ shot_short_mid_range_fga <int> 1175, 1191, 1019, 1672, 17…
$ shot_short_mid_range_frequency <dbl> 0.26553672, 0.18812194, 0.…
$ shot_short_mid_range_accuracy <dbl> 0.4868085, 0.3853904, 0.37…
$ shot_unblocked_short_mid_range_accuracy <dbl> 0.5190563, 0.4305816, 0.41…
$ shot_short_mid_range_pct_assisted <dbl> 0.54195804, 0.33551198, 0.…
$ shot_short_mid_range_pct_blocked <dbl> 0.06212766, 0.10495382, 0.…
$ shot_long_mid_range_fgm <int> 171, 300, 239, 934, 315, 1…
$ shot_long_mid_range_fga <int> 439, 767, 681, 2057, 733, …
$ shot_long_mid_range_frequency <dbl> 0.099209040, 0.121149897, …
$ shot_long_mid_range_accuracy <dbl> 0.3895216, 0.3911343, 0.35…
$ shot_unblocked_long_mid_range_accuracy <dbl> 0.3995327, 0.3957784, 0.35…
$ shot_long_mid_range_pct_assisted <dbl> 0.4912281, 0.3833333, 0.26…
$ shot_long_mid_range_pct_blocked <dbl> 0.025056948, 0.011734029, …
$ shot_corner3fgm <int> 361, 110, 104, 74, 40, 15,…
$ shot_corner3fga <int> 850, 277, 295, 221, 107, 4…
$ shot_corner3frequency <dbl> 0.192090395, 0.043752962, …
$ shot_corner3accuracy <dbl> 0.4247059, 0.3971119, 0.35…
$ shot_unblocked_corner3accuracy <dbl> 0.4267139, 0.3971119, 0.35…
$ shot_corner3pct_assisted <dbl> 0.9806094, 0.7909091, 0.86…
$ shot_corner3pct_blocked <dbl> 0.004705882, 0.000000000, …
$ shot_arc3fgm <int> 329, 864, 788, 128, 508, 1…
$ shot_arc3fga <int> 969, 2396, 2178, 405, 1440…
$ shot_arc3frequency <dbl> 0.218983051, 0.378455220, …
$ shot_arc3accuracy <dbl> 0.3395253, 0.3606010, 0.36…
$ shot_unblocked_arc3accuracy <dbl> 0.3416407, 0.3622642, 0.36…
$ shot_arc3pct_assisted <dbl> 0.8814590, 0.5069444, 0.51…
$ shot_arc3pct_blocked <dbl> 0.006191950, 0.004590985, …
$ shot_non_heave_arc3fgm <int> 328, 864, 788, 128, 508, 1…
$ shot_non_heave_arc3fga <int> 955, 2393, 2161, 395, 1438…
$ shot_non_heave_arc3accuracy <dbl> 0.3434555, 0.3610531, 0.36…
$ shot_heave_attempts <int> 13, 3, 17, 10, 2, 1, 5, 23…
$ shot_heave_makes <int> 1, 0, 0, 0, 0, 0, 0, 0, 0,…
$ turnovers_turnovers <int> 451, 852, 922, 622, 565, 9…
$ turnovers_live_ball_turnovers <int> 297, 464, 540, 404, 320, 5…
$ turnovers_dead_ball_turnovers <int> 154, 388, 382, 218, 245, 4…
$ turnovers_live_ball_turnover_pct <dbl> 0.6585366, 0.5446009, 0.58…
$ turnovers_lost_ball_turnovers <int> 102, 203, 253, 156, 148, 2…
$ turnovers_lost_ball_out_of_bounds_turnovers <int> 38, 59, 68, 48, 21, 74, 41…
$ turnovers_bad_pass_turnovers <int> 195, 261, 287, 248, 172, 2…
$ turnovers_bad_pass_out_of_bounds_turnovers <int> 56, 124, 141, 63, 75, 107,…
$ turnovers_travels <int> 18, 44, 63, 40, 34, 72, 47…
$ turnovers_x3second_violations <int> 0, 2, 2, 0, 9, 15, 0, 6, 1…
$ turnovers_step_out_of_bounds_turnovers <int> 15, 12, 14, 10, 7, 5, 9, 3…
$ turnovers_offensive_goaltends <int> 0, 1, 2, 0, 3, 1, 1, 4, 1,…
Data Quality
Let’s see if we have any missing or empty values we need to worry about:
# Check for missing or empty values
na_summary <- sapply(nba_stats, function(col) {
sum(is.na(col) | is.null(col) | col == "", na.rm = TRUE)
})
any(na_summary > 0)[1] FALSE
No, all columns and rows feature complete observations.
Scale Variables
We’re going to implement scaling of the variables in order to properly apply the PCA algorithm. First, we’ll separate the numberic and ID features from the data.
nba_numeric <- nba_stats |> select(
where(is.numeric) &
-c(profile_height_in, profile_weight_lbs)
)
nba_ids <- nba_stats |> select(
where(is.character) |
c(profile_games_played, profile_minutes, profile_height_in, profile_weight_lbs)
)In order to avoid sensitivity to volume (i.e. one player playing more than another), we need to create volume adjusted measures based on minutes (i.e. points per “36 minutes”, an industry standard). We’ll do this for all integer columns (so as to preserve the integrity of “rate” statistics, like “three-point percentage”). We’ll then drop the games and minutes features:
nba_adjusted <- nba_numeric |>
mutate(across(
where(is.integer) & -c(profile_games_played, profile_minutes),
~ . * 36.0 / profile_minutes
)) |>
select(-c(profile_games_played, profile_minutes))And finally, we scale the numeric variables:
nba_scaled <- scale(nba_adjusted)Principal Component Analysis
Using the scaled variables, let’s perform PCA. We’ll try to cluster with and without PCA. We’ll have both data set versions at our disposal for clustering.
nba_pca <- princomp(nba_scaled)summary(nba_pca)Importance of components:
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
Standard deviation 7.907895 5.1462278 3.4041457 2.66083766 2.4306877
Proportion of Variance 0.321334 0.1360858 0.0595458 0.03638075 0.0303594
Cumulative Proportion 0.321334 0.4574198 0.5169656 0.55334632 0.5837057
Comp.6 Comp.7 Comp.8 Comp.9 Comp.10
Standard deviation 2.38120145 2.28542661 1.99469866 1.88094633 1.75536205
Proportion of Variance 0.02913581 0.02683919 0.02044511 0.01817974 0.01583318
Cumulative Proportion 0.61284153 0.63968072 0.66012583 0.67830557 0.69413876
Comp.11 Comp.12 Comp.13 Comp.14 Comp.15
Standard deviation 1.6568553 1.53539884 1.47993093 1.43593006 1.389579004
Proportion of Variance 0.0141060 0.01211371 0.01125428 0.01059501 0.009922048
Cumulative Proportion 0.7082448 0.72035847 0.73161275 0.74220776 0.752129812
Comp.16 Comp.17 Comp.18 Comp.19
Standard deviation 1.3641266 1.331451758 1.304909173 1.255396634
Proportion of Variance 0.0095619 0.009109315 0.008749745 0.008098354
Cumulative Proportion 0.7616917 0.770801027 0.779550773 0.787649127
Comp.20 Comp.21 Comp.22 Comp.23
Standard deviation 1.215643107 1.202024017 1.17196870 1.159173138
Proportion of Variance 0.007593588 0.007424396 0.00705776 0.006904488
Cumulative Proportion 0.795242715 0.802667111 0.80972487 0.816629358
Comp.24 Comp.25 Comp.26 Comp.27
Standard deviation 1.124481456 1.109380008 1.09655417 1.066552118
Proportion of Variance 0.006497398 0.006324053 0.00617867 0.005845195
Cumulative Proportion 0.823126756 0.829450809 0.83562948 0.841474675
Comp.28 Comp.29 Comp.30 Comp.31
Standard deviation 1.049545355 1.027979984 1.008532128 1.002988524
Proportion of Variance 0.005660272 0.005430054 0.005226541 0.005169241
Cumulative Proportion 0.847134946 0.852565001 0.857791541 0.862960782
Comp.32 Comp.33 Comp.34 Comp.35
Standard deviation 0.97724998 0.960136207 0.945637156 0.934723089
Proportion of Variance 0.00490734 0.004736969 0.004594983 0.004489529
Cumulative Proportion 0.86786812 0.872605092 0.877200074 0.881689604
Comp.36 Comp.37 Comp.38 Comp.39
Standard deviation 0.916978377 0.910562982 0.898229189 0.878372129
Proportion of Variance 0.004320689 0.004260444 0.004145808 0.003964532
Cumulative Proportion 0.886010293 0.890270737 0.894416544 0.898381077
Comp.40 Comp.41 Comp.42 Comp.43
Standard deviation 0.857086920 0.839611729 0.834929882 0.827178550
Proportion of Variance 0.003774719 0.003622362 0.003582077 0.003515875
Cumulative Proportion 0.902155795 0.905778157 0.909360234 0.912876108
Comp.44 Comp.45 Comp.46 Comp.47
Standard deviation 0.812988721 0.798641549 0.781463388 0.767881925
Proportion of Variance 0.003396283 0.003277469 0.003137994 0.003029868
Cumulative Proportion 0.916272391 0.919549861 0.922687855 0.925717723
Comp.48 Comp.49 Comp.50 Comp.51
Standard deviation 0.76373612 0.734865089 0.720837237 0.708591892
Proportion of Variance 0.00299724 0.002774918 0.002669988 0.002580045
Cumulative Proportion 0.92871496 0.931489880 0.934159868 0.936739913
Comp.52 Comp.53 Comp.54 Comp.55
Standard deviation 0.699893597 0.69560943 0.682230518 0.66152373
Proportion of Variance 0.002517091 0.00248637 0.002391647 0.00224867
Cumulative Proportion 0.939257004 0.94174337 0.944135021 0.94638369
Comp.56 Comp.57 Comp.58 Comp.59
Standard deviation 0.656140029 0.639355610 0.630872837 0.621482067
Proportion of Variance 0.002212218 0.002100486 0.002045119 0.001984687
Cumulative Proportion 0.948595909 0.950696395 0.952741514 0.954726201
Comp.60 Comp.61 Comp.62 Comp.63
Standard deviation 0.606892195 0.598237879 0.589569555 0.579751076
Proportion of Variance 0.001892596 0.001839004 0.001786097 0.001727102
Cumulative Proportion 0.956618797 0.958457801 0.960243897 0.961970999
Comp.64 Comp.65 Comp.66 Comp.67
Standard deviation 0.562726007 0.552033414 0.547205173 0.542456886
Proportion of Variance 0.001627155 0.001565906 0.001538634 0.001512047
Cumulative Proportion 0.963598154 0.965164060 0.966702693 0.968214740
Comp.68 Comp.69 Comp.70 Comp.71
Standard deviation 0.530315310 0.522451609 0.519397356 0.508823656
Proportion of Variance 0.001445118 0.001402578 0.001386227 0.001330361
Cumulative Proportion 0.969659858 0.971062436 0.972448663 0.973779023
Comp.72 Comp.73 Comp.74 Comp.75
Standard deviation 0.489730309 0.478104943 0.467514197 0.462894960
Proportion of Variance 0.001232392 0.001174577 0.001123116 0.001101032
Cumulative Proportion 0.975011415 0.976185992 0.977309107 0.978410139
Comp.76 Comp.77 Comp.78 Comp.79
Standard deviation 0.459996993 0.45336006 0.444074268 0.4281947615
Proportion of Variance 0.001087289 0.00105614 0.001013319 0.0009421446
Cumulative Proportion 0.979497427 0.98055357 0.981566886 0.9825090303
Comp.80 Comp.81 Comp.82 Comp.83
Standard deviation 0.4187124287 0.4039841057 0.3930316736 0.3881281819
Proportion of Variance 0.0009008792 0.0008386165 0.0007937614 0.0007740789
Cumulative Proportion 0.9834099095 0.9842485260 0.9850422874 0.9858163662
Comp.84 Comp.85 Comp.86 Comp.87
Standard deviation 0.3813032905 0.3714911500 0.3638091857 0.350640222
Proportion of Variance 0.0007470952 0.0007091397 0.0006801147 0.000631769
Cumulative Proportion 0.9865634615 0.9872726011 0.9879527158 0.988584485
Comp.88 Comp.89 Comp.90 Comp.91
Standard deviation 0.3483221088 0.3452766491 0.3381688860 0.3255143029
Proportion of Variance 0.0006234433 0.0006125891 0.0005876275 0.0005444713
Cumulative Proportion 0.9892079281 0.9898205172 0.9904081447 0.9909526160
Comp.92 Comp.93 Comp.94 Comp.95
Standard deviation 0.3169023497 0.312833968 0.310982739 0.2999009790
Proportion of Variance 0.0005160429 0.000502878 0.000496944 0.0004621581
Cumulative Proportion 0.9914686589 0.991971537 0.992468481 0.9929306390
Comp.96 Comp.97 Comp.98 Comp.99
Standard deviation 0.2969837314 0.2886732854 0.2839062227 0.2786547326
Proportion of Variance 0.0004532107 0.0004282014 0.0004141758 0.0003989952
Cumulative Proportion 0.9933838497 0.9938120511 0.9942262268 0.9946252221
Comp.100 Comp.101 Comp.102 Comp.103
Standard deviation 0.2670630443 0.2626119636 0.2506656178 0.2387134134
Proportion of Variance 0.0003664903 0.0003543756 0.0003228675 0.0002928117
Cumulative Proportion 0.9949917123 0.9953460880 0.9956689555 0.9959617673
Comp.104 Comp.105 Comp.106 Comp.107
Standard deviation 0.2378688367 0.2229413349 0.2168524945 0.2121617325
Proportion of Variance 0.0002907435 0.0002553971 0.0002416371 0.0002312964
Cumulative Proportion 0.9962525107 0.9965079079 0.9967495450 0.9969808414
Comp.108 Comp.109 Comp.110 Comp.111
Standard deviation 0.2066857499 0.1962149302 0.191377167 0.1867972530
Proportion of Variance 0.0002195108 0.0001978331 0.000188198 0.0001792982
Cumulative Proportion 0.9972003523 0.9973981853 0.997586383 0.9977656815
Comp.112 Comp.113 Comp.114 Comp.115
Standard deviation 0.1795094417 0.1750229081 0.1718757618 0.1656258088
Proportion of Variance 0.0001655806 0.0001574072 0.0001517973 0.0001409584
Cumulative Proportion 0.9979312621 0.9980886694 0.9982404667 0.9983814250
Comp.116 Comp.117 Comp.118 Comp.119
Standard deviation 0.1605428578 0.1578360974 0.1494406610 0.1433299991
Proportion of Variance 0.0001324393 0.0001280111 0.0001147552 0.0001055623
Cumulative Proportion 0.9985138643 0.9986418754 0.9987566306 0.9988621930
Comp.120 Comp.121 Comp.122 Comp.123
Standard deviation 1.329839e-01 0.1305931473 1.276026e-01 1.256506e-01
Proportion of Variance 9.087267e-05 0.0000876346 8.366689e-05 8.112669e-05
Cumulative Proportion 9.989531e-01 0.9990407002 9.991244e-01 9.992055e-01
Comp.124 Comp.125 Comp.126 Comp.127
Standard deviation 1.199905e-01 1.117375e-01 1.081000e-01 1.039045e-01
Proportion of Variance 7.398242e-05 6.415533e-05 6.004632e-05 5.547577e-05
Cumulative Proportion 9.992795e-01 9.993436e-01 9.994037e-01 9.994592e-01
Comp.128 Comp.129 Comp.130 Comp.131
Standard deviation 1.003603e-01 9.682056e-02 9.374808e-02 8.973321e-02
Proportion of Variance 5.175581e-05 4.816927e-05 4.516059e-05 4.137531e-05
Cumulative Proportion 9.995109e-01 9.995591e-01 9.996042e-01 9.996456e-01
Comp.132 Comp.133 Comp.134 Comp.135
Standard deviation 8.790406e-02 0.0844869330 7.616143e-02 7.421226e-02
Proportion of Variance 3.970569e-05 0.0000366787 2.980609e-05 2.829998e-05
Cumulative Proportion 9.996853e-01 0.9997219990 9.997518e-01 9.997801e-01
Comp.136 Comp.137 Comp.138 Comp.139
Standard deviation 6.815107e-02 6.730069e-02 6.484186e-02 6.177823e-02
Proportion of Variance 2.386603e-05 2.327415e-05 2.160458e-05 1.961127e-05
Cumulative Proportion 9.998040e-01 9.998272e-01 9.998488e-01 9.998685e-01
Comp.140 Comp.141 Comp.142 Comp.143
Standard deviation 5.866739e-02 5.572205e-02 5.361931e-02 5.199413e-02
Proportion of Variance 1.768595e-05 1.595472e-05 1.477329e-05 1.389132e-05
Cumulative Proportion 9.998861e-01 9.999021e-01 9.999169e-01 9.999308e-01
Comp.144 Comp.145 Comp.146 Comp.147
Standard deviation 4.881348e-02 4.206109e-02 3.979695e-02 3.752638e-02
Proportion of Variance 1.224375e-05 9.090668e-06 8.138312e-06 7.236161e-06
Cumulative Proportion 9.999430e-01 9.999521e-01 9.999602e-01 9.999675e-01
Comp.148 Comp.149 Comp.150 Comp.151
Standard deviation 3.598320e-02 3.098396e-02 3.024330e-02 2.483474e-02
Proportion of Variance 6.653260e-06 4.932972e-06 4.699951e-06 3.169233e-06
Cumulative Proportion 9.999741e-01 9.999791e-01 9.999838e-01 9.999869e-01
Comp.152 Comp.153 Comp.154 Comp.155
Standard deviation 2.028335e-02 1.869018e-02 1.830842e-02 1.732935e-02
Proportion of Variance 2.114045e-06 1.794988e-06 1.722410e-06 1.543119e-06
Cumulative Proportion 9.999890e-01 9.999908e-01 9.999926e-01 9.999941e-01
Comp.156 Comp.157 Comp.158 Comp.159
Standard deviation 1.607269e-02 1.482187e-02 1.359563e-02 1.154321e-02
Proportion of Variance 1.327431e-06 1.128862e-06 9.498035e-07 6.846808e-07
Cumulative Proportion 9.999954e-01 9.999966e-01 9.999975e-01 9.999982e-01
Comp.160 Comp.161 Comp.162 Comp.163
Standard deviation 9.457766e-03 7.895078e-03 7.487837e-03 7.187040e-03
Proportion of Variance 4.596339e-07 3.202932e-07 2.881029e-07 2.654208e-07
Cumulative Proportion 9.999987e-01 9.999990e-01 9.999993e-01 9.999995e-01
Comp.164 Comp.165 Comp.166 Comp.167
Standard deviation 6.987417e-03 6.537667e-03 1.141043e-07 4.703629e-08
Proportion of Variance 2.508812e-07 2.196243e-07 6.690194e-17 1.136844e-17
Cumulative Proportion 9.999998e-01 1.000000e+00 1.000000e+00 1.000000e+00
Comp.168 Comp.169 Comp.170 Comp.171
Standard deviation 2.919608e-08 2.340045e-08 1.517024e-08 1.481497e-08
Proportion of Variance 4.380101e-18 2.813736e-18 1.182551e-18 1.127812e-18
Cumulative Proportion 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00
Comp.172 Comp.173 Comp.174 Comp.175 Comp.176
Standard deviation 1.461163e-08 1.262019e-08 0 0 0
Proportion of Variance 1.097064e-18 8.184013e-19 0 0 0
Cumulative Proportion 1.000000e+00 1.000000e+00 1 1 1
Comp.177 Comp.178 Comp.179 Comp.180 Comp.181 Comp.182
Standard deviation 0 0 0 0 0 0
Proportion of Variance 0 0 0 0 0 0
Cumulative Proportion 1 1 1 1 1 1
Comp.183 Comp.184 Comp.185 Comp.186 Comp.187 Comp.188
Standard deviation 0 0 0 0 0 0
Proportion of Variance 0 0 0 0 0 0
Cumulative Proportion 1 1 1 1 1 1
Comp.189 Comp.190 Comp.191 Comp.192 Comp.193 Comp.194
Standard deviation 0 0 0 0 0 0
Proportion of Variance 0 0 0 0 0 0
Cumulative Proportion 1 1 1 1 1 1
Comp.195
Standard deviation 0
Proportion of Variance 0
Cumulative Proportion 1
We have 195 numeric source variables so 195 principal components. The first 15 components explain 75% of variance, the first 40 explain 90% of the variance and the first 57 principal components explain 95%. That means just over 1/4 of our original number of features explain nearly all of the total variance.
PCA did a good job at 1) reducing dimensionality and 2) eliminating any colinearity of features.
Let’s save our results:
nba_pca_data <- as.data.frame(nba_pca$scores)Let’s visualize the first two principal components:
ggplot(
nba_pca_data,
aes(Comp.1, Comp.2)
) +
geom_point()Clustering
Let’s try to cluster these observations.
Historically, basketball has used 5 positions. In modern times, this has been reduced to approximately 3. Let’s use 3 as our minimum and 12 as our maximum. Let’s try a few different clustering techniques.
clustering_grid <- data.frame(
clusters = 3:20
)Against each of these, we can run different clustering algorithms and produce measures for the “within sum of squares”. This will help us determine the proper number of clusters derived from the data.
Partition Clustering
Let’s setup a function to run a kmeans cluster for every number of cluster in the above grid and generate the respective performance metric:
cluster_kmeans <- function(k, data) {
fit <- kmeans(data, k)
vals <- glance(fit)
return(vals$tot.withinss)
}Hierarchical Clustering
Let’s do the same thing but for an hclust algorithm:
cluster_hclust <- function(k, data) {
# Run the algorithm
model <- hier_clust(num_clusters = k, linkage_method = "complete")
fit <- model |> fit(~., data = as.data.frame(data))
wss <- fit |>
sse_within() |>
select(wss) |>
unlist() |>
sum()
return(wss)
}Let’s now generate our clusters!
Cluster Results
Let’s map over the number of clusters and execute the respective algorithm.
clustering_grid_01 <-
clustering_grid |>
mutate(
kmeans = map(clusters, ~ cluster_kmeans(.x, nba_scaled)),
hclust = map(clusters, ~ cluster_hclust(.x, nba_scaled))
)We can now plot these and find the “elbow”, or the point of diminishing returns from an increasing the number of clusters.
clustering_grid_01 |>
pivot_longer(cols = -clusters) |>
unnest(value) |>
ggplot(aes(factor(clusters), as.numeric(value))) +
geom_line(aes(color = name), group = 1) +
facet_wrap(~name, ncol = 1, scales = "free")hclust gives the impression that around 10 is the right number of clusters, though the elbow is difficult to identify. kmeans suggests 9 or 10.
Let’s see if we get different results using just the first 53 principal components.
clustering_grid_02 <-
clustering_grid |>
mutate(
kmeans = map(clusters, ~ cluster_kmeans(.x, nba_pca_data[, 1:57])),
hclust = map(clusters, ~ cluster_hclust(.x, nba_pca_data[, 1:57]))
)clustering_grid_02 |>
pivot_longer(cols = -clusters) |>
unnest(value) |>
ggplot(aes(factor(clusters), as.numeric(value))) +
geom_line(aes(color = name), group = 1) +
facet_wrap(~name, ncol = 1, scales = "free")The same algorithms with the first 57 principal components indicate somewhere around 9 to 10. Let’s proceed with the PCA results and assume clusters of 10. I also think the smoothing of kmeans is a little nicer so let’s default to that algorithm.
set.seed(2015)
fit <- kmeans(nba_pca_data[, 1:57], 10)
nba_ids$cluster <- factor(fit$cluster)
nba_adjusted_full <- nba_adjusted |> mutate(cluster = factor(fit$cluster))
nba_scaled_full <- as.data.frame(nba_scaled) |> mutate(cluster = factor(fit$cluster))
nba_ids |>
count(cluster) |>
mutate(prop = n / sum(n)) cluster n prop
1 1 37 0.074
2 2 64 0.128
3 3 50 0.100
4 4 38 0.076
5 5 52 0.104
6 6 58 0.116
7 7 99 0.198
8 8 59 0.118
9 9 28 0.056
10 10 15 0.030
The initial results seem fairly reasonable. Understandably, some clusters (or as we would interpret, “positions”/“roles”/“styles”) have more players than others given the nature of the game.
Let’s evaluate some specific players and get a sense for the results.
Cluster Evaluation
The first example deals with 4 players typically thought of as “centers”. Their physical profiles are somewhat similar but there are significant differences in style and role. We should probably see three different cluster assignments.
nba_ids |> filter(
profile_name %in% c(
"Victor Wembanyama",
"Nikola Jokic",
"Clint Capela",
"Rudy Gobert"
)
) profile_name profile_team_abbreviation profile_games_played
1 Nikola Jokic DEN 313
2 Rudy Gobert MIN 306
3 Clint Capela ATL 300
4 Victor Wembanyama SAS 90
profile_minutes profile_height_in profile_weight_lbs cluster
1 10738 83 284 10
2 9837 85 258 9
3 8121 82 256 9
4 2717 87 235 10
We see Capela and Gobert with the same cluster assignment, making sense, but Jokic and Wembanyama are also assigned to the same. Given their offensive game, this could make sense, as most of their differentiators are on the defensive end.
Let’s try another. These three all have similar physical profiles, roles, and play styles. Let’s see how they are clustered.
nba_ids |> filter(
profile_name %in% c(
"Jimmy Butler",
"Jayson Tatum",
"Jaylen Brown"
)
) profile_name profile_team_abbreviation profile_games_played profile_minutes
1 Jayson Tatum BOS 311 11237
2 Jaylen Brown BOS 280 9653
3 Jimmy Butler MIA 250 8402
profile_height_in profile_weight_lbs cluster
1 80 210 4
2 78 223 4
3 79 230 4
We see they all fall into the same cluster! This gives some assurance that the clustering is capturing some of the inherent patterns.
What about all players who’ve historically been labeled “point guards”. Each of these play so differently we should see completely different cluster assignments.
nba_ids |> filter(
profile_name %in% c(
"Collin Sexton",
"Bruce Brown",
"Stephen Curry",
"Jose Alvarado"
)
) profile_name profile_team_abbreviation profile_games_played profile_minutes
1 Stephen Curry GSW 275 9275
2 Bruce Brown TOR 284 7370
3 Collin Sexton UTA 220 6293
4 Jose Alvarado NOP 182 3454
profile_height_in profile_weight_lbs cluster
1 74 185 4
2 76 202 2
3 75 190 4
4 73 179 6
All different except for Curry and Sexton. We’ll have to dig into the cluster more closely to learn about this. So far its tracking pretty close to what a contextual lens might suggest.
Let’s look at at some of the most dissimilar players from a physical profile that have the same cluster.
getMinMax <- function(data, cluster = 1) {
data_f <- data[data$cluster == cluster, ]
data_f$val <- (scale(data_f$profile_height_in) + scale(data_f$profile_weight_lbs)) / 2
data_f <- data_f |> arrange(desc(val))
return(c(
data_f$profile_name[nrow(data_f)],
data_f$profile_name[1]
))
}for (c in sort(unique(nba_ids$cluster))) {
players <- getMinMax(nba_ids, c)
print(paste(
c, "- Min:", players[1],
"| Max:", players[2]
))
}[1] "1 - Min: Scotty Pippen Jr. | Max: Ben Simmons"
[1] "2 - Min: Bruce Brown | Max: Brook Lopez"
[1] "3 - Min: Seth Curry | Max: Danilo Gallinari"
[1] "4 - Min: Trae Young | Max: Paolo Banchero"
[1] "5 - Min: Ausar Thompson | Max: Robin Lopez"
[1] "6 - Min: D.J. Augustin | Max: Joe Ingles"
[1] "7 - Min: Johnny Davis | Max: Maxi Kleber"
[1] "8 - Min: Isaiah Joe | Max: Mike Muscala"
[1] "9 - Min: Isaiah Jackson | Max: JaVale McGee"
[1] "10 - Min: Domantas Sabonis | Max: Jusuf Nurkic"
Generally speaking, these make a lot of sense. The next step would be to analyze each cluster and come up with unique labels for them that describe the new position/role/style.
Cluster Naming
Clusters are labeled arbitrarily. There’s nothing intuitive by labels 1, 2, etc. We need to give these clusters meaning by assigning 1 a descriptive label.
We’ll do that two ways:
We’ll create a penalized logistic regression model for each individual cluster and select the highest absolute value of the coefficients. In this way, we can understand some of the predictors that define the cluster.
We’ll leverage AI to use what it knows about the players in each cluster to give label suggestions.
We’ll pool these perspectives together to generate our own label. We’ll save the label in the following table:
cluster_labels <- tibble(
cluster = factor(1:10),
label = as.character(NA),
abbrev = as.character(NA),
)Guidelines
We want to shy away from traditional language: “guard”, “forward”, “center”. Even labels like “backcourt”, “frontcourt” can pigeonhole a group of players in unhelpful ways, potentially. Additionally, modern terms like “wing” and “big” we may want to shy away from.
This puts more emphasis on style of play and role than physical profile or position.
Top Features Model
Here’s the function that will ingest our dataset, classifify against a binary target (1 = cluster of interest, 0 = all other clusters).
We’ll perform cross validation, take the best model, fit on the entire data set, and take the top coefficients.
get_elasnet_top_features <- function(data) {
# Configure recipe
mod_rec <- recipe(target ~ ., data)
# Setup cross-validation folds
mod_cv <- rsample::vfold_cv(data, v = 5)
# Configure tuning grid
mod_tune_grid <- grid_regular(
penalty(),
mixture(),
levels = 4
)
# Setup model definition
mod_def <- logistic_reg(
mixture = tune(),
penalty = tune()
) |>
set_engine("glmnet")
# Configure workflow
mod_wflw <-
workflow() |>
add_model(mod_def) |>
add_recipe(mod_rec)
# Run cross-validated tuning
set.seed(814)
mod_tune <-
mod_wflw |>
tune_grid(
resamples = mod_cv,
grid = mod_tune_grid,
metrics = metric_set(roc_auc)
)
# Select & fit best model
best_mod <- mod_tune |> select_best(metric = "roc_auc")
final_wflw <- mod_wflw |> finalize_workflow(best_mod)
final_fit <- fit(final_wflw, data = data)
# Capture the top predictors by absolute value of coefficient
tidy(final_fit) |>
arrange(desc(abs(estimate))) |>
slice_head(n = 10) |>
select(-penalty) |>
print()
}We’ll call this for each cluster.
We’ll setup some parallelization for this:
set.seed(729)
# Define parallelization
cores_target <- ceiling(parallel::detectCores() * 0.75)
doParallel::registerDoParallel(cores = cores_target)Aritificial Intelligence
We’re going to let AI suggest some labels. All it will see is 1) our prompt and 2) the player names pertaining to the cluster. This will lead to a less biased approach
This is the prompt we’ll use (along with the list of names) against OpenAI’s ChatGPT 4o model:
Below are a list of recent NBA player names. Assume these players belong in a collective group based on their play style and role. Generate 5 unique suggestions for a group label that is short and sweet but descriptive of the group. Restrict evaluation to style and role; avoid analysis rooted in reputation, playing time, etc.
Cluster #1
Player names:
nba_ids[nba_ids$cluster == "1", ]$profile_name [1] "Russell Westbrook" "Draymond Green" "Kyle Anderson"
[4] "Cole Anthony" "Josh Giddey" "T.J. McConnell"
[7] "Tre Jones" "Killian Hayes" "Talen Horton-Tucker"
[10] "Jalen Suggs" "Jaden Ivey" "Ben Simmons"
[13] "Theo Maledon" "Dennis Smith Jr." "Ricky Rubio"
[16] "Markelle Fultz" "Ish Smith" "R.J. Hampton"
[19] "Kris Dunn" "Derrick Rose" "Dalano Banton"
[22] "Tomas Satoransky" "Scoot Henderson" "Elfrid Payton"
[25] "Jordan Goodwin" "Josh Christopher" "Saben Lee"
[28] "Trent Forrest" "Blake Wesley" "Keon Johnson"
[31] "Rajon Rondo" "Vasilije Micic" "Brad Wanamaker"
[34] "Daishen Nix" "Jared Butler" "Scotty Pippen Jr."
[37] "Brandon Goodwin"
Let’s pull the top features explaining this cluster assignment:
target_cluster_df <- nba_scaled_full |>
mutate(target = factor(ifelse(cluster == "1", 1, 0))) |>
select(-cluster)
get_elasnet_top_features(target_cluster_df)# A tibble: 10 × 2
term estimate
<chr> <dbl>
1 (Intercept) -6.99
2 fouls_transition_take_fouls_drawn 0.385
3 misc_on_off_rtg -0.344
4 shot_short_mid_range_frequency 0.333
5 turnovers_lost_ball_out_of_bounds_turnovers 0.301
6 second_second_chance_turnovers 0.300
7 misc_plus_minus -0.273
8 fouls_clear_path_fouls 0.271
9 misc_blocked_corner3 0.265
10 free_technical_free_throw_trips -0.263
These top features are interesting. There’s clear evidence of a) athleticism and skill in the open court, b) some tendency toward mistakes, and 3) overall sub-impact.
ChatGPT generated the following label suggestions:
- Playmaking Hustlers
- Versatile Initiators
- Dynamic Facilitators
- Crafty Drivers
- Hybrid Creators
I’m somewhat drawn to words like “initiator” and “hustler”. I don’t see any evidence of “facilitator” or “creator”. There’s a good mix of physical profile and style. Let’s go with “Versatile Anchor”.
# Save cluster label
cluster_labels[cluster_labels$cluster == "1", 2] <- "Versatile Anchor"
cluster_labels[cluster_labels$cluster == "1", 3] <- "VA"Cluster #2
Player names:
nba_ids[nba_ids$cluster == "2", ]$profile_name [1] "Nikola Vucevic" "Pascal Siakam" "Tobias Harris"
[4] "Josh Hart" "Keldon Johnson" "Kyle Kuzma"
[7] "Aaron Gordon" "John Collins" "Deni Avdija"
[10] "Lauri Markkanen" "Scottie Barnes" "Bobby Portis"
[13] "Brook Lopez" "Miles Bridges" "Bruce Brown"
[16] "Myles Turner" "Michael Porter Jr." "Jaren Jackson Jr."
[19] "Rui Hachimura" "Kelly Olynyk" "Naz Reid"
[22] "KJ Martin" "Christian Wood" "Jae'Sean Tate"
[25] "Jabari Smith Jr." "Jalen Williams" "Jonathan Kuminga"
[28] "Larry Nance Jr." "Moritz Wagner" "Trey Lyles"
[31] "Kevin Love" "Bennedict Mathurin" "Darius Bazley"
[34] "Santi Aldama" "Jeremy Sochan" "Jalen Johnson"
[37] "Zach Collins" "Hamidou Diallo" "Aleksej Pokusevski"
[40] "Keita Bates-Diop" "Chimezie Metu" "Trendon Watford"
[43] "JaMychal Green" "Dario Saric" "Isaiah Roby"
[46] "Blake Griffin" "Chet Holmgren" "Josh Jackson"
[49] "Otto Porter Jr." "Bol Bol" "Justise Winslow"
[52] "Serge Ibaka" "Anthony Gill" "Jaylin Williams"
[55] "Nemanja Bjelica" "LaMarcus Aldridge" "Sandro Mamukelashvili"
[58] "Paul Millsap" "Eric Paschall" "Duop Reath"
[61] "David Nwaba" "Gorgui Dieng" "Frank Kaminsky"
[64] "Eugene Omoruyi"
Let’s pull the top features explaining this cluster assignment:
target_cluster_df <- nba_scaled_full |>
mutate(target = factor(ifelse(cluster == "2", 1, 0))) |>
select(-cluster)
get_elasnet_top_features(target_cluster_df)# A tibble: 10 × 2
term estimate
<chr> <dbl>
1 (Intercept) -4.51
2 second_second_chance_corner3frequency -0.703
3 turnovers_x3second_violations -0.669
4 second_second_chance_at_rim_frequency 0.624
5 second_second_chance_arc3frequency -0.561
6 free_technical_free_throw_trips -0.555
7 shot_long_mid_range_pct_blocked 0.443
8 shot_at_rim_accuracy 0.414
9 fouls_charge_fouls_drawn 0.409
10 scoring_fg2a_blocked 0.384
What first catches me eye with these features are second_change_at_rim_frequency and at_rim_accuracy. These are players oriented near the basket. Next what catches my eye are some of the second_change_3 variations but that are negative. As in lower frequency on second chance opportunities but not necessarily lower on overall 3s.
ChatGPT generated the following label suggestions:
- Versatile Wings
- Dynamic Bigs
- Stretch Forwards
- Two-Way Frontcourt
- Hybrid Playmakers
In comparing these options to the player list set, I’m drawn to the “Hybrid Playmakers” label. It’s not very descriptive. We have again, a good mix of physical profile and style. Therefore, let’s settle on “Versatile Finisher”.
# Save cluster label
cluster_labels[cluster_labels$cluster == "2", 2] <- "Versatile Finisher"
cluster_labels[cluster_labels$cluster == "2", 3] <- "VF"Cluster #3
Player names:
nba_ids[nba_ids$cluster == "3", ]$profile_name [1] "Mikal Bridges" "RJ Barrett" "Terry Rozier"
[4] "Gary Trent Jr." "Saddiq Bey" "Luguentz Dort"
[7] "Andrew Wiggins" "Tyler Herro" "Franz Wagner"
[10] "Jalen Green" "Jerami Grant" "Dillon Brooks"
[13] "Norman Powell" "Bojan Bogdanovic" "Desmond Bane"
[16] "Bogdan Bogdanovic" "Devin Vassell" "Eric Gordon"
[19] "De'Andre Hunter" "Josh Richardson" "Alec Burks"
[22] "Klay Thompson" "Seth Curry" "Gordon Hayward"
[25] "Marcus Morris Sr." "Lonnie Walker IV" "Will Barton"
[28] "Evan Fournier" "Terrence Ross" "Danilo Gallinari"
[31] "Carmelo Anthony" "Malaki Branham" "Jaylen Nowell"
[34] "Jordan Nwora" "Shaedon Sharpe" "Rudy Gay"
[37] "Chris Duarte" "Furkan Korkmaz" "Brandon Miller"
[40] "Kendrick Nunn" "Terence Davis" "Jaden Hardy"
[43] "Brandon Boston Jr." "Dwayne Bacon" "Jeremy Lamb"
[46] "AJ Griffin" "Jordan Hawkins" "Duane Washington Jr."
[49] "Denzel Valentine" "GG Jackson II"
Let’s pull the top features explaining this cluster assignment:
target_cluster_df <- nba_scaled_full |>
mutate(target = factor(ifelse(cluster == "3", 1, 0))) |>
select(-cluster)
get_elasnet_top_features(target_cluster_df)# A tibble: 10 × 2
term estimate
<chr> <dbl>
1 (Intercept) -6.45
2 misc_at_rim_off_rebounded_pct 0.560
3 assists_long_mid_range_assists -0.507
4 fouls_charge_fouls_drawn -0.475
5 second_second_chance_at_rim_accuracy 0.445
6 shot_corner3frequency -0.436
7 shot_short_mid_range_pct_assisted -0.428
8 assists_arc3assists -0.415
9 shot_heave_attempts 0.410
10 assists_three_pt_assists -0.392
What catches my eye are lower on assist features than other groups but higher rebounders and some hustle stuff here.
ChatGPT generated the following label suggestions:
- Scoring Wings
- Perimeter Playmakers
- Versatile Shooters
- Dynamic Swingmen
- Offensive Engines
I see all of these words in this group to some extent. The combo that most seems interesting is probably “versatile” and “engine”, so we’ll go with “Versatile Engine”.
# Save cluster label
cluster_labels[cluster_labels$cluster == "3", 2] <- "Versatile Engine"
cluster_labels[cluster_labels$cluster == "3", 3] <- "VE"Cluster #4
Player names:
nba_ids[nba_ids$cluster == "4", ]$profile_name [1] "Jayson Tatum" "Anthony Edwards"
[3] "DeMar DeRozan" "Dejounte Murray"
[5] "Luka Doncic" "De'Aaron Fox"
[7] "Trae Young" "Jalen Brunson"
[9] "Devin Booker" "Jaylen Brown"
[11] "James Harden" "Darius Garland"
[13] "Stephen Curry" "Donovan Mitchell"
[15] "Shai Gilgeous-Alexander" "Damian Lillard"
[17] "LeBron James" "Zach LaVine"
[19] "Jordan Poole" "Chris Paul"
[21] "Jimmy Butler" "Brandon Ingram"
[23] "Kevin Durant" "Kyrie Irving"
[25] "Jordan Clarkson" "Paul George"
[27] "Bradley Beal" "Khris Middleton"
[29] "Ja Morant" "LaMelo Ball"
[31] "Jamal Murray" "Collin Sexton"
[33] "Kawhi Leonard" "Paolo Banchero"
[35] "Cade Cunningham" "Kevin Porter Jr."
[37] "Cam Thomas" "John Wall"
Let’s pull the top features explaining this cluster assignment:
target_cluster_df <- nba_scaled_full |>
mutate(target = factor(ifelse(cluster == "4", 1, 0))) |>
select(-cluster)
get_elasnet_top_features(target_cluster_df)# A tibble: 10 × 2
term estimate
<chr> <dbl>
1 (Intercept) -3.42
2 scoring_pts_unassisted2s 0.0598
3 scoring_pts_unassisted3s 0.0577
4 free_technical_free_throw_trips 0.0566
5 misc_first_chance_points 0.0537
6 scoring_points 0.0517
7 shot_long_mid_range_fgm 0.0509
8 scoring_usage 0.0507
9 shot_long_mid_range_fga 0.0492
10 scoring_ft_points 0.0471
This group is pretty clear. Scoring, creating, playmaking.
ChatGPT generated the following label suggestions:
- Elite Creators
- Dynamic Scorers
- Playmaking Stars
- Offensive Leaders
- All-Around Playmakers
To borrow a word from a previous cluster, I like the word “engine”. These are the players that engage the team’s “drivetrain”, so to speak. Let’s go with “Perimeter Engine”.
# Save cluster label
cluster_labels[cluster_labels$cluster == "4", 2] <- "Perimeter Engine"
cluster_labels[cluster_labels$cluster == "4", 3] <- "PE"Cluster #5
Player names:
nba_ids[nba_ids$cluster == "5", ]$profile_name [1] "Jarrett Allen" "Deandre Ayton" "Evan Mobley"
[4] "Isaiah Stewart" "Wendell Carter Jr." "Nic Claxton"
[7] "Kevon Looney" "Chris Boucher" "Isaiah Hartenstein"
[10] "Jarred Vanderbilt" "Onyeka Okongwu" "Precious Achiuwa"
[13] "Dwight Powell" "Drew Eubanks" "Marvin Bagley III"
[16] "Brandon Clarke" "Mo Bamba" "Xavier Tillman Sr."
[19] "Jaxson Hayes" "Daniel Theis" "Montrezl Harrell"
[22] "Richaun Holmes" "Thaddeus Young" "Jalen Smith"
[25] "Nick Richards" "Goga Bitadze" "Paul Reed"
[28] "James Wiseman" "Tari Eason" "Taj Gibson"
[31] "Khem Birch" "Luke Kornet" "Jabari Walker"
[34] "Jock Landale" "Zeke Nnaji" "Thomas Bryant"
[37] "Alex Len" "Robin Lopez" "Damian Jones"
[40] "Nerlens Noel" "Enes Freedom" "Amen Thompson"
[43] "Wenyen Gabriel" "Dewayne Dedmon" "Derrick Favors"
[46] "Ausar Thompson" "Jonathan Isaac" "Thanasis Antetokounmpo"
[49] "Omer Yurtseven" "Tony Bradley" "Usman Garuba"
[52] "Terry Taylor"
Let’s pull the top features explaining this cluster assignment:
target_cluster_df <- nba_scaled_full |>
mutate(target = factor(ifelse(cluster == "5", 1, 0))) |>
select(-cluster)
get_elasnet_top_features(target_cluster_df)# A tibble: 10 × 2
term estimate
<chr> <dbl>
1 (Intercept) -7.40
2 shot_corner3pct_assisted 0.497
3 shot_arc3pct_assisted 0.440
4 shot_heave_attempts 0.426
5 scoring_assisted3s_pct 0.392
6 rebounds_self_o_reb_pct -0.319
7 turnovers_bad_pass_out_of_bounds_turnovers -0.310
8 second_second_chance_fg3pct 0.310
9 turnovers_travels 0.260
10 second_second_chance_arc3frequency -0.254
What’s interesting about this group’s features compared to the list of players relates to how much is assisted and the propencity for 3P field goals.
ChatGPT generated the following label suggestions:
- Rim Protectors
- Paint Enforcers
- Dynamic Bigs
- Post Specialists
- Interior Anchors
“Dynamic Bigs” is the only example I like but we’re trying to shy away from physical profiles and stick with style/role. Let’s target “Interior Connector”.
# Save cluster label
cluster_labels[cluster_labels$cluster == "5", 2] <- "Interior Connector"
cluster_labels[cluster_labels$cluster == "5", 3] <- "IC"Cluster #6
Player names:
nba_ids[nba_ids$cluster == "6", ]$profile_name [1] "Fred VanVleet" "Tyrese Haliburton"
[3] "Jrue Holiday" "Coby White"
[5] "Dennis Schroder" "CJ McCollum"
[7] "Tyrese Maxey" "Derrick White"
[9] "D'Angelo Russell" "Mike Conley"
[11] "Caris LeVert" "Kyle Lowry"
[13] "Reggie Jackson" "Immanuel Quickley"
[15] "Spencer Dinwiddie" "Tyus Jones"
[17] "Anfernee Simons" "Malik Monk"
[19] "Marcus Smart" "Austin Reaves"
[21] "Malcolm Brogdon" "De'Anthony Melton"
[23] "Nickeil Alexander-Walker" "Monte Morris"
[25] "Devonte' Graham" "Delon Wright"
[27] "Payton Pritchard" "Davion Mitchell"
[29] "Joe Ingles" "Shake Milton"
[31] "Cameron Payne" "Gabe Vincent"
[33] "Cory Joseph" "Aaron Holiday"
[35] "Andrew Nembhard" "Tre Mann"
[37] "Eric Bledsoe" "Jose Alvarado"
[39] "Jordan McLaughlin" "Raul Neto"
[41] "Lonzo Ball" "Bones Hyland"
[43] "Malachi Flynn" "Ty Jerome"
[45] "George Hill" "Keyonte George"
[47] "Goran Dragic" "Facundo Campazzo"
[49] "Brandin Podziemski" "Kemba Walker"
[51] "Victor Oladipo" "Lou Williams"
[53] "D.J. Augustin" "Kira Lewis Jr."
[55] "Frank Ntilikina" "Marcus Sasser"
[57] "Trey Burke" "Skylar Mays"
Let’s pull the top features explaining this cluster assignment:
target_cluster_df <- nba_scaled_full |>
mutate(target = factor(ifelse(cluster == "6", 1, 0))) |>
select(-cluster)
get_elasnet_top_features(target_cluster_df)# A tibble: 10 × 2
term estimate
<chr> <dbl>
1 (Intercept) -6.23
2 fouls_charge_fouls_drawn 0.496
3 shot_long_mid_range_pct_assisted -0.421
4 fouls_transition_take_fouls_drawn -0.387
5 free_non_shooting_fouls_drawn -0.381
6 free_three_pt_shooting_fouls_drawn -0.372
7 shot_short_mid_range_pct_assisted -0.371
8 scoring_assisted2s_pct -0.354
9 scoring_non_putbacks_assisted2s_pct -0.339
10 second_second_chance_corner3pct_assisted 0.334
These features are interesting. What sticks out? a) their scoring isn’t really assisted and 2) there’s some unique approach to defense where they draw a lot of fouls.
ChatGPT generated the following label suggestions:
- Floor Generals
- Playmaking Guards
- Perimeter Orchestrators
- Dynamic Ball Handlers
- Backcourt Catalysts
Again, we’re trying to shy away from traditional terminology. I like “orchestrator” as it eludes more responsibility than just “facilitator”. However, it’s often used mostly for traditional guard positions. Let’s go with “Perimeter Anchor”, instead.
# Save cluster label
cluster_labels[cluster_labels$cluster == "6", 2] <- "Perimeter Anchor"
cluster_labels[cluster_labels$cluster == "6", 3] <- "PA"Cluster #7
Player names:
nba_ids[nba_ids$cluster == "7", ]$profile_name [1] "Harrison Barnes" "Royce O'Neale" "Dorian Finney-Smith"
[4] "P.J. Washington" "Jaden McDaniels" "Isaac Okoro"
[7] "OG Anunoby" "Kelly Oubre Jr." "Terance Mann"
[10] "Grant Williams" "Ayo Dosunmu" "Nicolas Batum"
[13] "Al Horford" "Herbert Jones" "Caleb Martin"
[16] "Patrick Williams" "Alex Caruso" "Jeff Green"
[19] "Taurean Prince" "P.J. Tucker" "Matisse Thybulle"
[22] "Patrick Beverley" "Derrick Jones Jr." "Torrey Craig"
[25] "Robert Covington" "Aaron Nesmith" "Obi Toppin"
[28] "Josh Green" "Naji Marshall" "Kenrich Williams"
[31] "Maxi Kleber" "John Konchar" "Dean Wade"
[34] "Troy Brown Jr." "Jalen McDaniels" "Cody Martin"
[37] "Josh Okogie" "Aaron Wiggins" "Cam Reddish"
[40] "Chuma Okeke" "Oshae Brissett" "Christian Braun"
[43] "Ochai Agbaji" "Ziaire Williams" "Javonte Green"
[46] "Lamar Stevens" "Dyson Daniels" "Nassir Little"
[49] "Garrett Temple" "Haywood Highsmith" "Danuel House Jr."
[52] "Moses Moody" "Juan Toscano-Anderson" "Jeremiah Robinson-Earl"
[55] "David Roddy" "Gary Payton II" "Stanley Johnson"
[58] "Yuta Watanabe" "Jaime Jaquez Jr." "James Johnson"
[61] "Toumani Camara" "Bilal Coulibaly" "Peyton Watson"
[64] "Kevin Knox II" "Markieff Morris" "Andre Iguodala"
[67] "JT Thor" "DeAndre' Bembry" "Juancho Hernangomez"
[70] "Max Christie" "Romeo Langford" "Sterling Brown"
[73] "Kent Bazemore" "Jake LaRavia" "Anthony Black"
[76] "Bryce McGowens" "Kessler Edwards" "Solomon Hill"
[79] "Rodney Hood" "Maurice Harkless" "Anthony Lamb"
[82] "Kris Murray" "Edmond Sumner" "Vince Williams Jr."
[85] "PJ Dozier" "Vit Krejci" "Vlatko Cancar"
[88] "CJ Elleby" "Semi Ojeleye" "Nikola Jovic"
[91] "MarJon Beauchamp" "Ish Wainright" "Trevor Ariza"
[94] "Jalen Wilson" "Dalen Terry" "Dante Exum"
[97] "Johnny Davis" "Ousmane Dieng" "Joshua Primo"
Let’s pull the top features explaining this cluster assignment:
target_cluster_df <- nba_scaled_full |>
mutate(target = factor(ifelse(cluster == "7", 1, 0))) |>
select(-cluster)
get_elasnet_top_features(target_cluster_df)# A tibble: 10 × 2
term estimate
<chr> <dbl>
1 (Intercept) -4.12
2 free_two_pt_shooting_fouls_drawn_pct 0.413
3 second_second_chance_shot_quality_avg 0.378
4 scoring_efg_pct -0.359
5 second_second_chance_arc3pct_assisted 0.353
6 scoring_ts_pct -0.334
7 shot_long_mid_range_pct_blocked -0.330
8 shot_avg2pt_shot_distance -0.314
9 second_second_chance_at_rim_frequency 0.282
10 second_second_chance_arc3frequency -0.262
What sticks out is not efficient scorers. But there is some creating, getting to the free-throw line stuff that’s interesting.
ChatGPT generated the following label suggestions:
- Two-Way Wings
- Defensive Specialists
- Versatile Role Players
- Perimeter Stoppers
- Glue Guys
There are certainly some good defenders in this list but that’s not the primary thing here. “Two-Way” could be good. “Glue” and “connector” are intriguing words. I think the point is they do a bit of everything, but specialize in non-scoring events. Let’s go with “Versatile Connector”.
# Save cluster label
cluster_labels[cluster_labels$cluster == "7", 2] <- "Versatile Connector"
cluster_labels[cluster_labels$cluster == "7", 3] <- "VC"Cluster #8
Player names:
nba_ids[nba_ids$cluster == "8", ]$profile_name [1] "Buddy Hield" "Kentavious Caldwell-Pope"
[3] "Kevin Huerter" "Malik Beasley"
[5] "Tim Hardaway Jr." "Grayson Allen"
[7] "Donte DiVincenzo" "Duncan Robinson"
[9] "Georges Niang" "Cameron Johnson"
[11] "Pat Connaughton" "Reggie Bullock Jr."
[13] "Max Strus" "Corey Kispert"
[15] "Keegan Murray" "Justin Holiday"
[17] "Cedi Osman" "Luke Kennard"
[19] "Gary Harris" "Trey Murphy III"
[21] "Doug McDermott" "Patty Mills"
[23] "Jae Crowder" "Garrison Mathews"
[25] "Amir Coffey" "Jevon Carter"
[27] "Quentin Grimes" "Landry Shamet"
[29] "Isaiah Joe" "Joe Harris"
[31] "Damion Lee" "Sam Hauser"
[33] "Danny Green" "Davis Bertans"
[35] "Wesley Matthews" "Austin Rivers"
[37] "Svi Mykhailiuk" "Miles McBride"
[39] "Bryn Forbes" "Simone Fontecchio"
[41] "Mike Muscala" "Julian Champagnie"
[43] "Cason Wallace" "Ben McLemore"
[45] "Isaiah Livers" "Avery Bradley"
[47] "Frank Jackson" "Gradey Dick"
[49] "Sam Merrill" "Wayne Ellington"
[51] "Tony Snell" "Timothe Luwawu-Cabarrot"
[53] "Caleb Houstan" "Lindy Waters III"
[55] "Keon Ellis" "Rodney McGruder"
[57] "Armoni Brooks" "AJ Green"
[59] "Ben Sheppard"
Let’s pull the top features explaining this cluster assignment:
target_cluster_df <- nba_scaled_full |>
mutate(target = factor(ifelse(cluster == "8", 1, 0))) |>
select(-cluster)
get_elasnet_top_features(target_cluster_df)# A tibble: 10 × 2
term estimate
<chr> <dbl>
1 (Intercept) -3.00
2 shot_corner3fgm 0.0538
3 scoring_pts_assisted3s 0.0534
4 second_second_chance_corner3frequency 0.0480
5 scoring_fg3a_pct 0.0463
6 shot_corner3fga 0.0451
7 scoring_assisted2s_pct 0.0443
8 shot_at_rim_pct_assisted 0.0425
9 second_second_chance_corner3fgm 0.0419
10 second_second_chance_fg3m 0.0415
What sticks out is the orientation to perimeter activity, clearly.
ChatGPT generated the following label suggestions:
- Catch-and-Shoot Crew
- Perimeter Marksmen
- Wing Snipers
- Spot-Up Specialists
- Floor Spacers
I think I like “Perimeter Finisher” best. It fits with some of the other language we’ve used too so it’s cohesive.
# Save cluster label
cluster_labels[cluster_labels$cluster == "8", 2] <- "Perimeter Finisher"
cluster_labels[cluster_labels$cluster == "8", 3] <- "PF"Cluster #9
Player names:
nba_ids[nba_ids$cluster == "9", ]$profile_name [1] "Rudy Gobert" "Ivica Zubac" "Clint Capela"
[4] "Jakob Poeltl" "Mason Plumlee" "Daniel Gafford"
[7] "Andre Drummond" "Mitchell Robinson" "Steven Adams"
[10] "Jalen Duren" "Robert Williams III" "Walker Kessler"
[13] "Bismack Biyombo" "DeAndre Jordan" "Tristan Thompson"
[16] "JaVale McGee" "Isaiah Jackson" "Jericho Sims"
[19] "Dwight Howard" "Willy Hernangomez" "Day'Ron Sharpe"
[22] "Cody Zeller" "Moses Brown" "Dereck Lively II"
[25] "Hassan Whiteside" "Bruno Fernando" "Trayce Jackson-Davis"
[28] "Mark Williams"
Let’s pull the top features explaining this cluster assignment:
target_cluster_df <- nba_scaled_full |>
mutate(target = factor(ifelse(cluster == "9", 1, 0))) |>
select(-cluster)
get_elasnet_top_features(target_cluster_df)# A tibble: 10 × 2
term estimate
<chr> <dbl>
1 (Intercept) -10.5
2 shot_corner3pct_assisted -0.240
3 rebounds_self_o_reb_pct 0.210
4 second_second_chance_fg3pct -0.190
5 shot_avg3pt_shot_distance 0.177
6 turnovers_x3second_violations 0.166
7 second_second_chance_arc3pct_assisted -0.160
8 rebounds_self_o_reb 0.158
9 shot_unblocked_corner3accuracy -0.158
10 shot_corner3accuracy -0.157
In the opposite vein to above, this group sticks out for their presence on the interior.
ChatGPT generated the following label suggestions:
- Rim Protectors
- Paint Guardians
- Defensive Anchors
- Rebounding Specialists
- Post Defenders
Let’s go with “Interior Anchor”. This helps describe the role on both ends of the floor. If we were just focused on “offense”, we would classify as “Interior Finisher”.
# Save cluster label
cluster_labels[cluster_labels$cluster == "9", 2] <- "Interior Anchor"
cluster_labels[cluster_labels$cluster == "9", 3] <- "IA"Cluster #10
Player names:
nba_ids[nba_ids$cluster == "10", ]$profile_name [1] "Domantas Sabonis" "Nikola Jokic" "Julius Randle"
[4] "Bam Adebayo" "Giannis Antetokounmpo" "Jonas Valanciunas"
[7] "Anthony Davis" "Karl-Anthony Towns" "Joel Embiid"
[10] "Kristaps Porzingis" "Alperen Sengun" "Jusuf Nurkic"
[13] "Zion Williamson" "Victor Wembanyama" "DeMarcus Cousins"
Let’s pull the top features explaining this cluster assignment:
target_cluster_df <- nba_scaled_full |>
mutate(target = factor(ifelse(cluster == "10", 1, 0))) |>
select(-cluster)
get_elasnet_top_features(target_cluster_df)# A tibble: 10 × 2
term estimate
<chr> <dbl>
1 (Intercept) -20.7
2 fouls_fouls_drawn 4.26
3 shot_arc3pct_assisted 2.54
4 rebounds_def_at_rim_rebound_pct 2.34
5 misc_at_rim_off_rebounded_pct 2.30
6 misc_blocked_long_mid_range 1.86
7 assists_arc3assists 1.43
8 turnovers_dead_ball_turnovers 1.13
9 misc_blocked_corner3 -1.07
10 fouls_charge_fouls 1.05
What sticks out in this group is 3P shooting with a bunch of rebounding and being in the center of the action.
ChatGPT generated the following label suggestions:
- Skilled Bigs
- Versatile Frontcourt
- Playmaking Centers
- Dominant Big Men
- All-Around Bigs
Let’s go with “Interior Engine” for the moment. It describes this idea of engaging the team’s “drivetrain” from inside-out.
# Save cluster label
cluster_labels[cluster_labels$cluster == "10", 2] <- "Interior Engine"
cluster_labels[cluster_labels$cluster == "10", 3] <- "IE"Now we can stop the parallelization:
doParallel::stopImplicitCluster()New Position Labels
Summary
Here’s our final clusters, with labels and abbreviations. We settled on distinguishing orientation of play and style with “interior”, “perimeter”, and “versatile” descriptors. Obviously this isn’e exclusive, a “perimeter engine” would certainly score and operate on the interior as well, but it describes their tendencies: “out-in” vs “in-out”.
Next we settled on 4 roles or styles: “connector”, “anchor”, “finisher”, and “engine”. Again, these aren’t exclusive but indicate where players tend and lean in their overall style and assumed roles.
cluster_labels |>
arrange(label)# A tibble: 10 × 3
cluster label abbrev
<fct> <chr> <chr>
1 9 Interior Anchor IA
2 5 Interior Connector IC
3 10 Interior Engine IE
4 6 Perimeter Anchor PA
5 4 Perimeter Engine PE
6 8 Perimeter Finisher PF
7 1 Versatile Anchor VA
8 7 Versatile Connector VC
9 3 Versatile Engine VE
10 2 Versatile Finisher VF
Let’s go through some exercises where we intersect these labels with the original data. Now that these labels have meaning, we can benchmark more easily against domain knowledge.
nba_adjusted_full <- nba_adjusted_full |>
inner_join(cluster_labels) |>
bind_cols(nba_ids)Player Distributions
How many players across the NBA fall into these groups? We’d expect “Engines” to have the fewest, the “Versatile” group to be the largest sum overall. Let’s see if that matches up with preconceptions.
nba_adjusted_full |>
group_by(label, abbrev) |>
count() |>
ungroup() |>
mutate(perc = n / sum(n))# A tibble: 10 × 4
label abbrev n perc
<chr> <chr> <int> <dbl>
1 Interior Anchor IA 28 0.056
2 Interior Connector IC 52 0.104
3 Interior Engine IE 15 0.03
4 Perimeter Anchor PA 58 0.116
5 Perimeter Engine PE 38 0.076
6 Perimeter Finisher PF 59 0.118
7 Versatile Anchor VA 37 0.074
8 Versatile Connector VC 99 0.198
9 Versatile Engine VE 50 0.1
10 Versatile Finisher VF 64 0.128
Our theories held pretty well. “Engines” are about the smallest of each of their respective groups. The “Versatile” group is the largest of them all.
League Distributions
We have labels for the “current” teams. Let’s see how many of each of these belong to each team.
NOTE: data gathered represents only the top 500 players by minutes over the last 4.5 seasons; therefore, some teams will see fewer than others and some players don’t actively belong to a team
nba_adjusted_full |>
group_by(profile_team_abbreviation, abbrev) |>
count() |>
ungroup() |>
arrange(abbrev) |>
pivot_wider(
names_from = abbrev,
values_from = n,
values_fill = 0
) |>
gt::gt()| profile_team_abbreviation | IA | IC | IE | PA | PE | PF | VA | VC | VE | VF |
|---|---|---|---|---|---|---|---|---|---|---|
| ATL | 1 | 1 | 0 | 1 | 1 | 3 | 1 | 4 | 4 | 2 |
| BKN | 1 | 2 | 0 | 2 | 1 | 2 | 3 | 4 | 1 | 3 |
| CHA | 1 | 2 | 0 | 2 | 1 | 1 | 2 | 3 | 2 | 2 |
| CLE | 1 | 3 | 0 | 3 | 2 | 3 | 3 | 3 | 1 | 0 |
| DAL | 2 | 1 | 0 | 4 | 2 | 1 | 0 | 6 | 2 | 0 |
| DEN | 1 | 1 | 2 | 0 | 1 | 1 | 1 | 3 | 0 | 3 |
| DET | 1 | 3 | 0 | 2 | 1 | 6 | 2 | 2 | 1 | 2 |
| GSW | 1 | 2 | 0 | 2 | 1 | 2 | 2 | 4 | 1 | 3 |
| HOU | 1 | 3 | 1 | 2 | 0 | 1 | 1 | 1 | 2 | 4 |
| IND | 2 | 1 | 0 | 3 | 0 | 1 | 1 | 3 | 0 | 3 |
| LAC | 1 | 1 | 0 | 2 | 4 | 1 | 1 | 7 | 1 | 0 |
| LAL | 1 | 2 | 1 | 5 | 1 | 2 | 0 | 5 | 1 | 2 |
| MIN | 1 | 0 | 1 | 3 | 1 | 3 | 1 | 2 | 0 | 1 |
| NOP | 2 | 1 | 1 | 2 | 2 | 2 | 1 | 3 | 3 | 0 |
| NYK | 2 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 2 | 1 |
| OKC | 1 | 2 | 0 | 0 | 1 | 3 | 0 | 4 | 2 | 3 |
| PHI | 1 | 2 | 1 | 3 | 1 | 1 | 0 | 4 | 2 | 2 |
| PHX | 1 | 1 | 1 | 2 | 3 | 2 | 2 | 4 | 2 | 1 |
| POR | 1 | 2 | 0 | 1 | 0 | 1 | 2 | 5 | 2 | 3 |
| SAC | 1 | 1 | 1 | 2 | 2 | 5 | 0 | 2 | 2 | 2 |
| TOR | 2 | 2 | 0 | 2 | 0 | 1 | 0 | 4 | 3 | 4 |
| UTA | 2 | 2 | 0 | 2 | 2 | 3 | 0 | 0 | 2 | 4 |
| BOS | 0 | 3 | 1 | 3 | 2 | 1 | 0 | 2 | 0 | 1 |
| CHI | 0 | 3 | 0 | 2 | 1 | 1 | 2 | 4 | 1 | 1 |
| MEM | 0 | 2 | 0 | 1 | 1 | 1 | 3 | 5 | 2 | 2 |
| MIA | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 3 | 4 | 1 |
| MIL | 0 | 2 | 1 | 2 | 2 | 2 | 0 | 4 | 2 | 3 |
| ORL | 0 | 3 | 0 | 1 | 1 | 3 | 3 | 2 | 2 | 1 |
| WAS | 0 | 2 | 1 | 1 | 1 | 2 | 3 | 2 | 1 | 4 |
| SAS | 0 | 0 | 1 | 1 | 1 | 2 | 2 | 3 | 2 | 6 |
It’s pretty interesting to see different approaches.
- Interior
- ORL (Orlando Magic): no Engines or Anchors, but 3 Interior Connectors (Wendell Carter Jr., Goga Bitadze, Jonathan Isaac)
- SAS (San Antonio Spurs): no Anchors or Connectors and only 1 Engine (Victor Wembanyama)
- Perimeter
- CLE: 8 players total, 3 of which are Perimeter Finishers (Georges Niang, Max Strus, Sam Merrill)
- DEN: only 2 players (Finisher: Justin Holiday, Engine: Jamal Murray) and no Perimeter Anchors
- Versatile
- BOS (Boston Celtics): only 3 players and no Engines or Anchors
- SAS (San Antonio Spurs): a total of 13, 6 of which are Versatile Finishers
Team Distributions
Let’s take a team like Boston and see which of our new position labels are getting minutes. Let’s workup a quick function for that:
peek_team_dist <- function(team) {
nba_adjusted_full |>
filter(profile_team_abbreviation == team) |>
mutate(mp_gm = profile_minutes / profile_games_played) |>
arrange(desc(mp_gm)) |>
select(profile_name, label, abbrev, mp_gm) |>
gt::gt()
}peek_team_dist("BOS")| profile_name | label | abbrev | mp_gm |
|---|---|---|---|
| Jayson Tatum | Perimeter Engine | PE | 36.13183 |
| Jaylen Brown | Perimeter Engine | PE | 34.47500 |
| Jrue Holiday | Perimeter Anchor | PA | 32.56537 |
| Kristaps Porzingis | Interior Engine | IE | 30.58824 |
| Derrick White | Perimeter Anchor | PA | 30.30796 |
| Al Horford | Versatile Connector | VC | 28.64198 |
| Enes Freedom | Interior Connector | IC | 20.27103 |
| Blake Griffin | Versatile Finisher | VF | 18.97902 |
| Payton Pritchard | Perimeter Anchor | PA | 18.60481 |
| Oshae Brissett | Versatile Connector | VC | 18.25000 |
| Sam Hauser | Perimeter Finisher | PF | 17.71220 |
| Xavier Tillman Sr. | Interior Connector | IC | 16.85950 |
| Luke Kornet | Interior Connector | IC | 12.98469 |
Really interesting. Boston is led by their Engines, then Anchors, and then some Connectors.
Let’s try another team, say the Portland Trailblazers:
peek_team_dist("POR")| profile_name | label | abbrev | mp_gm |
|---|---|---|---|
| Jerami Grant | Versatile Engine | VE | 33.82917 |
| Deandre Ayton | Interior Connector | IC | 30.67669 |
| Anfernee Simons | Perimeter Anchor | PA | 28.69600 |
| Scoot Henderson | Versatile Anchor | VA | 27.98734 |
| Deni Avdija | Versatile Finisher | VF | 26.22508 |
| Toumani Camara | Versatile Connector | VC | 26.15957 |
| Shaedon Sharpe | Versatile Engine | VE | 25.98438 |
| Robert Williams III | Interior Anchor | IA | 23.99379 |
| Matisse Thybulle | Versatile Connector | VC | 21.23970 |
| Justise Winslow | Versatile Finisher | VF | 19.98058 |
| Kris Murray | Versatile Connector | VC | 19.45783 |
| Ben McLemore | Perimeter Finisher | PF | 18.71795 |
| Jabari Walker | Interior Connector | IC | 17.17007 |
| Duop Reath | Versatile Finisher | VF | 16.01235 |
| Bryce McGowens | Versatile Connector | VC | 15.76415 |
| CJ Elleby | Versatile Connector | VC | 15.53409 |
| Dalano Banton | Versatile Anchor | VA | 14.02924 |
Here we see some differences, with Connectors featured a little more towards the top wich a lot of Versatile type players as opposed to Perimeter or Interior focused.
Next Steps
There are many avenues to take this analysis. We could analyze impact of each position on winning, understand career earnings through the lens of these new positions, and much more.
Where we are going to take the analysis is in the direction of understanding the style and role of incoming NBA prospects. One of the toughest parts of scouting the next wave of talent is judging how their game translates to the professional level.
How do the first round talents in the upcoming 2025 NBA draft project across our new positions?
This next phase requires:
- Pre-NBA measures for as many players from our cluster assigned data set as possible
- This will be our “training” set; we’ll intersect pre-NBA stats with the derived positions so far
- NOTE: some players lack sufficient pre-NBA data due to coming directly from high school or playing internationally. These data sets are spotty and hard to access. For our purposes, we’ll concentrate on players who played in the NCAA (collegiate basketball league in the United States) prior to being drafted in the NBA. This ensures we have consistent data for modeling the relationship of pre-NBA performance and eventual NBA positions. It helps that roughly 80% of our original player list will be featured in the training set.
- Active collegiate player measures
- We’ll also want a “testing” set that features currently active collegiate players who have yet to play in the NBA. We’ll use the first round projections from No Ceilings
- For reasons explained previously, we won’t collect measures for any international prospects ranked by No Ceilings (these are just 4 of the 30 first round prospects projected by No Ceilings)
Projecting Incoming Prospects
Data Prep
We’ll first load our “training” data (collegiate data for active NBA players) and intersect with the new positions:
mbb_data_raw <- read.csv('../bballrefstats/college-players.csv')mbb_data_pos <- mbb_data_raw |>
inner_join(
nba_adjusted_full |> select(profile_name, cluster_label = label, cluster_abbrev = abbrev),
by = join_by(player_name == profile_name)
)We’ve got some NAs in our data:
# Check for missing or empty values
mbb_na <- sapply(mbb_data_pos, function(col) {
sum(is.na(col) | is.null(col) | col == "", na.rm = TRUE)
})
names(which(mbb_na > 0)) [1] "col_3pp" "col_obpm" "col_dbpm" "col_bpm" "col_per" "col_orbp"
[7] "col_drbp" "col_trbp" "col_astp" "col_stlp" "col_blkp" "col_usgp"
We do have some missing values. Why is that? There’s two primary reasons:
- Calculation limiation: three-point percentage requires 3P attempts. Where there are none, there is no possible value.
- Era limitations: not all measures throughout basketball history have been available due to tracking technology evlolving over time.
How do we resolve?
Well for #1, we’ll impute as zero. Since we account for volume via another measure (3PA), this shouldn’t be an issue.
mbb_fillna_1 <- mbb_data_pos |>
mutate(col_3pp = ifelse(is.na(col_3pp), 0, col_3pp))For the situation in #2, all players will have a value at least zero or greater. While not “captured” at the point in time, had the technology been there the values would be represented. Therefore, by imputing the values, we stick with best practice of determing what the values would have been. Let’s do some kNN imputation!
mbb_imputed <- kNN(
mbb_fillna_1,
variable = setdiff(names(which(mbb_na > 0)), "col_3pp"),
k = 5
) |>
select(!contains("_imp"))And let’s confirm we have fully complete data:
# Check for missing or empty values
mbb_na_2 <- sapply(mbb_imputed, function(col) {
sum(is.na(col) | is.null(col) | col == "", na.rm = TRUE)
})
any(mbb_na_2 > 0)[1] FALSE
And now, our training data is fully cleaned up! Let’s bring in our testing data, or the “incoming prospects”.
mbb_prospects <- read.csv('../bballrefstats/incoming-prospects.csv')# Check for missing or empty values
any(sapply(mbb_prospects, function(col) {
sum(is.na(col) | is.null(col) | col == "", na.rm = TRUE)
}) > 0)[1] FALSE
Perfect! And now, let’s just check to confirm we have similar columns:
setdiff(names(mbb_prospects), names(mbb_imputed))[1] "draft_projection"
The only column in our testing data that isn’t in training is draft_projection, which is a context only field. Let’s prep the models!
Modeling Prep
With our prepped data, we’re ready to start defining the models and cross validation infastructure needed. We’ll use two models: linear regression and a boosted trees approach (XGBoost).
We can evaluate feature importance with both but in different ways. They also are different approaches, the former being in the ordinary least squares family while the latter is tree-based. In this way, we can validate results.
Model Definition
mod_nn <- mlp(
hidden_units = tune(),
penalty = tune(),
epochs = tune()
) |>
set_engine("nnet") |>
set_mode("classification")
mod_xg <- boost_tree(
mtry = tune(),
trees = tune(),
tree_depth = tune(),
learn_rate = tune()
) |>
set_engine("xgboost") |>
set_mode("classification")Cross Validation
With cross validation, we’ll be able to confirm that the performance results we’re getting from the model aren’t due to chance. We’ll setup a 5 fold cross validation.
mod_cv <- rsample::vfold_cv(mbb_imputed, v = 5)We aren’t splitting into training and testing since we’re only concerned with making inference about the relationships. V fold cross validation will set up training and testing splits for us so those results will all us to test on untrained data anyway.
Recipe
Our recipe is fairly straight forward but we will put some extra preprocessing steps in there. In short, we want to predict team_net_rating using all of the positional labels.
mod_recipe <- recipe(cluster_label ~ ., mbb_imputed) |>
update_role(player_name, cluster_abbrev, new_role = "id") |>
step_dummy(school_conf) |>
step_normalize(is.numeric) |>
step_pca(is.numeric)Hyperparameters
mod_nn_grid <- grid_regular(
hidden_units(),
penalty(),
epochs(),
levels = 4
)
mod_xg_grid <- grid_regular(
trees(),
tree_depth(),
learn_rate(),
mtry(c(1, ceiling(sqrt(30)))),
levels = 4
)Fitting the Models
Parallelization:
# Define parallelization
cores_target <- ceiling(parallel::detectCores() * 0.75)
doParallel::registerDoParallel(cores = cores_target)
set.seed(814)Boosted Trees
# Configure workflow
mod_wflw_xg <-
workflow() |>
add_model(mod_xg) |>
add_recipe(mod_recipe)
# Run cross-validated tuning
set.seed(814)
mod_tune_xg <-
mod_wflw_xg |>
tune_grid(
resamples = mod_cv,
grid = mod_xg_grid,
metrics = metric_set(roc_auc)
)Neural Network
Now let’s tune the hyperparameters on the neural network using cross-validation. Just as before, we’ll setup a workflow with the model and recipe, then tune against the neural network grid we setup previously.
# Configure workflow
mod_wflw_nn <-
workflow() |>
add_model(mod_nn) |>
add_recipe(mod_recipe)
# Run cross-validated tuning
mod_tune_nn <-
mod_wflw_nn |>
tune_grid(
resamples = mod_cv,
grid = mod_nn_grid,
metrics = metric_set(roc_auc)
)Comparing the Models
With both of those tuned, let’s compare the top 5 configurations for each model:
mod_tune_xg |> collect_metrics() |> slice_max(mean, n = 5)# A tibble: 5 × 10
mtry trees tree_depth learn_rate .metric .estimator mean n std_err
<int> <int> <int> <dbl> <chr> <chr> <dbl> <int> <dbl>
1 4 667 1 0.1 roc_auc hand_till 0.815 5 0.00691
2 2 667 1 0.1 roc_auc hand_till 0.814 5 0.00731
3 1 667 1 0.1 roc_auc hand_till 0.813 5 0.00779
4 6 667 1 0.1 roc_auc hand_till 0.813 5 0.00716
5 6 1333 1 0.1 roc_auc hand_till 0.804 5 0.00817
# ℹ 1 more variable: .config <chr>
The XGBoost model is producing “AUC” values of right around 0.81. We tuned 4 hyperparameters, and we see some common themes: mtry of 1 or 2, tree_depth of 5, and learn_rate of 0.1e^-6. That’s a pretty performant model. Let’s plot all values:
mod_tune_nn |> collect_metrics() |> slice_max(mean, n = 5)# A tibble: 5 × 9
hidden_units penalty epochs .metric .estimator mean n std_err .config
<int> <dbl> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
1 10 1 670 roc_auc hand_till 0.854 5 0.00621 Preprocess…
2 10 1 1000 roc_auc hand_till 0.853 5 0.00641 Preprocess…
3 7 1 1000 roc_auc hand_till 0.852 5 0.00741 Preprocess…
4 7 1 340 roc_auc hand_till 0.852 5 0.00468 Preprocess…
5 10 1 340 roc_auc hand_till 0.851 5 0.00720 Preprocess…
The Neural Network model is producing “AUC” values of around 0.85. We tuned 3 hyperparameters. The only real common theme is a penalty of 1. We’re also seeing about half the std_err as we had with XGBoost. Additionally, the time to fit the model was quite a big lower with Neural Network.
By exploring both of these methods with extensive cross-validation and hyperparameter tuning, we’ve done extensive validation of our results.
Let’s proceed with its best configuration for predicting on the new prospects. Let’s make the final fit with these hyperparameters and all of the training data:
# Select & fit best model
best_mod_nn <- mod_tune_nn |> select_best(metric = "roc_auc")
final_wflw_nn <- mod_wflw_nn |> finalize_workflow(best_mod_nn)
final_fit_nn <- fit(final_wflw_nn, data = mbb_imputed)Predict for Prospects
Let’s now:
mbb_predictions <- predict(
final_fit_nn,
mbb_prospects |> select(-draft_projection) |> mutate(cluster_abbrev = "X")
)And here we have our predictions w/ prospects:
classified_prospects <-
mbb_prospects |>
mutate(cluster_label = mbb_predictions$.pred_class) |>
select(draft_projection, player_name, cluster_label)
classified_prospects draft_projection player_name cluster_label
1 1 Cooper Flagg Versatile Connector
2 2 Dylan Harper Perimeter Engine
3 3 Ace Bailey Versatile Connector
4 4 Egor Demin Perimeter Anchor
5 5 Tre Johnson Perimeter Finisher
6 6 VJ Edgecombe Versatile Anchor
7 7 Kasparas Jakucionis Perimeter Anchor
8 8 Khaman Maluach Interior Connector
9 9 Liam McNeeley Perimeter Finisher
10 10 Asa Newell Versatile Finisher
11 11 Kon Knueppel Perimeter Finisher
12 13 Collin Murray-Boyles Interior Connector
13 15 Derik Queen Interior Connector
14 18 Boogie Fland Perimeter Anchor
15 19 Nique Clifford Versatile Connector
16 20 Alex Karaban Perimeter Finisher
17 21 Will Riley Perimeter Finisher
18 22 Labaron Philon Versatile Anchor
19 23 Hunter Sallis Perimeter Finisher
20 24 KJ Lewis Versatile Connector
21 25 Drake Powell Versatile Connector
22 26 Ian Jackson Perimeter Finisher
23 27 Kanon Catchings Versatile Connector
24 28 Carter Bryant Versatile Connector
25 29 Jalil Bethea Perimeter Finisher
26 30 Mackenzie Mgbako Versatile Connector
Let’s take a peak at the distribution:
classified_prospects |>
count(cluster_label) |>
arrange(desc(n)) cluster_label n
1 Perimeter Finisher 8
2 Versatile Connector 8
3 Interior Connector 3
4 Perimeter Anchor 3
5 Versatile Anchor 2
6 Perimeter Engine 1
7 Versatile Finisher 1
This first round features a lot of Perimeter Finishers and Versatile Connectors (8 prospects each). This would be extremely helpful information as teams look to prioritize prospects during the scouting season and into the draft process of scheduling workouts, interviews, and ultimately selecting a prospect.