PCA & Clustering | Alzheimer’s disease data

Leveraging principal component analysis and clustering as exploratory analysis techniques in Alzheimer’s disease attributes.
Author

Adam Bushman

Published

November 19, 2024

Assignment Questions

Name

What is your name? Include all team members if submitting as a group.

Adam Bushman [u6049169]; no group members.

Perspective

From what perspective ar you conducting the analysis? (Who are you? / Who are you working for?)

I am statistician working with a group of researchers dedicated to better understanding of early Alzheimer’s detection. Despite no cure for the disease, early detection and diagnosis is critical for proper treatment to slow and mitigate onset of symptoms.

Question

What is your question?

Our research group has collected data from 333 individuals, some with cognitive impairment and others who are perfectly healthy. These attributes (columns) range from demographics to protein measurements to dementia scores.

The key question is:

What natural levels of alzheimer risk/exposure are isolated from available data?

This is would be a valuable insight since detection prioritiy, therapy and medications could be tailored to each level uniquely.

Data set

Describe your dataset(s) including URL (if available)

Data sourced from Posit via their {modeldata} package, accessed November 14th, 2024. Named Alzheimer’s disease data, the data were originally sourced from Kuhn, M., Johnson, K. (2013) Applied Predictive Modeling, Springer, which derived observations from a clinical study of 333 patients. Citation:

Craig-Schapiro R, Kuhn M, Xiong C, Pickering EH, Liu J, Misko TP, et al. (2011) Multiplexed Immunoassay Panel Identifies Novel CSF Biomarkers for Alzheimer’s Disease Diagnosis and Prognosis. PLoS ONE 6(4): e18850.

Variables

What are your variables? Include variable type (binary, categorical, numeric). If you have many variables, you can list the most important and summarize the rest (e.g. important variables are… also, 12 other binary, 5 categorical…).

This data set features 131 columns. Two factors and the rest numeric.

One is a natural dependent variable, Class, indicating if the patient is symptomatic of cognitive impairment or healthy. Since this exercise is not concerned with predictive modeling, we’ll largely ignore this variable. All remaining variables are independent.

It’s not clear which are most important for clustering patients at risk of or impaired with Alzheimer’s.

For a complete list of each feature and their data types, navigate to the feature summary.

Model resonability

How are your variables suitable for your analysis method?

The variables are suitable for the analysis since there’s 1) high dimensionality where PCA can really shine and 2) it’s not immediately clear what relationships work together for segmenting patients.

Because there’s an inherent class, a “ground-truth” that is binary, it may be that there’s no number of clusters we’re able to naturally discover just given the problem chosen.

Conclusions

What are your conclusions (include references to one or two CLEARLY INDICATED AND IMPORTANT graphs or tables in your output)?

Initial clustering of Alzheimer attributes across patients performed poorly, with values centering near zero with a hierarchical and partition-based clustering approach.

PCA appeared to be very successful at reducing the dimensionality. With just the first 5 principal components, we were able to capture 50% of the variance. With fewer than half of the principal components, 90% of the variance was captured.

Looking at the “ground-truth” variable Class in the context of PCA, it was clear the impaired vs control groups are not easily differentiated, even with the first two principal components.

The final clustering with PCA results yielded even worse cluster “tightness” scores than before. This is counter-intuitive and perhaps indicative of some failed assumptions made on the data set itself (see below).

Assumptions

What are your assumptions and limitations? Did you include any robustness checks?

A clear assumption was that impaired and control patients could be differentiated using this data set. It’s possible, and honestly likely, that they are not reliably segmented using the attribute values included here.

Another assumption is that more than two clusters may be identified in the data. This is unsubstantiated since I hadn’t the time to thoroughly test the number of clusters against the Hubert or D-index values. The fact that PCA worsened the performance of the clustering later on suggests that “3” is not a natural number of clusters.

Assignment Workflow

Analysis Prep

Loading packages

library('tidyverse')        # Wrapper for many convenient libraries
library('modeldata')        # Contains data for the assignment
library('skimr')            # Quickly summarise data
library('gt')               # Render "great tables"

library('kernlab')          # Weighted kernal k-means
library('dataPreparation')  # Utilities for PCA prep
library('dendextend')       # Working with dendrograms
library('cluster')          # Generating silhouette measures

Loading the data

We’ll start off by referencing the “Alzeimer’s Data” for the assignment from the {modeldata} package.

ad_raw <- modeldata::ad_data        # Data for the assignment

With it loaded into the session, let’s get a sense for what we’re working with.

Data set inspection

Normally, I like to get acquainted a data set. That means understanding what each column seeks to describe, confirming the granularity of the rows, and getting my arms around structure, completeness, distribution, etc. We’ll certainly do some of that, but this data set has over 130 columns.

Posit, the company behind {modeldata}, included the following summary of features classes captured in the dataset:

  • Demographic characteristics such as age and gender
  • Apolipoprotein E genotype
  • Protein measurements of Abeta, Tau, and a phosphorylated version of Tau (called pTau)
  • Protein measurements of 124 exploratory biomarkers, and
  • Clinical dementia scores

Using the {skimr} package, we can get a comprehensive summary of the data. :::{#feature-summary}

skim(ad_raw)
Data summary
Name ad_raw
Number of rows 333
Number of columns 131
_______________________
Column type frequency:
factor 2
numeric 129
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
Genotype 0 1 FALSE 6 E3E: 167, E3E: 106, E2E: 37, E4E: 13
Class 0 1 FALSE 2 Con: 242, Imp: 91

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
ACE_CD143_Angiotensin_Converti 0 1 1.32 0.55 -0.68 0.95 1.30 1.72 3.09 ▁▃▇▅▁
ACTH_Adrenocorticotropic_Hormon 0 1 -1.54 0.27 -2.21 -1.71 -1.56 -1.35 -0.80 ▂▆▇▅▁
AXL 0 1 0.30 0.45 -0.92 -0.01 0.28 0.61 1.52 ▁▅▇▃▁
Adiponectin 0 1 -5.22 0.67 -7.06 -5.68 -5.22 -4.83 -3.47 ▁▅▇▅▁
Alpha_1_Antichymotrypsin 0 1 1.35 0.36 0.18 1.10 1.36 1.59 2.30 ▁▃▇▆▂
Alpha_1_Antitrypsin 0 1 -13.14 1.55 -18.17 -14.14 -13.10 -12.13 -8.19 ▁▃▇▃▁
Alpha_1_Microglobulin 0 1 -2.94 0.48 -4.34 -3.27 -2.96 -2.62 -1.77 ▁▅▇▇▂
Alpha_2_Macroglobulin 0 1 -159.46 40.92 -289.68 -186.64 -160.01 -136.53 -50.17 ▁▃▇▅▂
Angiopoietin_2_ANG_2 0 1 0.66 0.33 -0.54 0.47 0.64 0.88 1.77 ▁▂▇▃▁
Angiotensinogen 0 1 2.31 0.25 1.71 2.12 2.31 2.48 2.88 ▂▆▇▇▂
Apolipoprotein_A_IV 0 1 -1.86 0.38 -2.96 -2.12 -1.83 -1.61 -0.78 ▁▅▇▅▁
Apolipoprotein_A1 0 1 -7.48 0.44 -8.68 -7.78 -7.47 -7.20 -6.17 ▁▅▇▃▁
Apolipoprotein_A2 0 1 -0.65 0.49 -1.97 -0.97 -0.69 -0.33 0.96 ▁▇▇▃▁
Apolipoprotein_B 0 1 -5.59 1.45 -9.94 -6.63 -5.70 -4.58 -2.15 ▁▅▇▇▂
Apolipoprotein_CI 0 1 -1.59 0.42 -3.32 -1.83 -1.61 -1.35 -0.27 ▁▁▇▅▁
Apolipoprotein_CIII 0 1 -2.50 0.44 -3.86 -2.78 -2.54 -2.21 -1.24 ▁▃▇▃▁
Apolipoprotein_D 0 1 1.43 0.35 0.26 1.19 1.41 1.69 2.64 ▁▅▇▃▁
Apolipoprotein_E 0 1 2.79 0.80 0.59 2.30 2.82 3.27 5.44 ▁▅▇▂▁
Apolipoprotein_H 0 1 -0.32 0.38 -2.23 -0.58 -0.36 -0.08 0.93 ▁▁▇▆▁
B_Lymphocyte_Chemoattractant_BL 0 1 1.99 0.57 0.73 1.67 1.98 2.37 4.02 ▃▇▇▁▁
BMP_6 0 1 -1.92 0.32 -2.76 -2.15 -1.89 -1.68 -0.82 ▁▅▇▂▁
Beta_2_Microglobulin 0 1 0.17 0.30 -0.54 -0.05 0.18 0.34 0.99 ▂▅▇▃▁
Betacellulin 0 1 51.35 10.82 10.00 43.00 51.00 59.00 82.00 ▁▁▇▅▁
C_Reactive_Protein 0 1 -5.90 1.19 -8.52 -6.65 -5.91 -5.10 -2.94 ▃▆▇▅▂
CD40 0 1 -1.26 0.22 -1.94 -1.38 -1.27 -1.12 -0.55 ▁▃▇▃▁
CD5L 0 1 -0.06 0.46 -1.97 -0.36 -0.06 0.26 1.16 ▁▁▇▇▂
Calbindin 0 1 22.25 4.20 10.81 19.73 22.08 24.68 35.36 ▁▅▇▃▁
Calcitonin 0 1 1.69 0.88 -0.71 1.10 1.65 2.28 4.11 ▁▆▇▅▁
CgA 0 1 330.71 82.13 135.60 278.02 328.81 386.49 535.40 ▂▅▇▅▂
Clusterin_Apo_J 0 1 2.88 0.30 1.87 2.71 2.89 3.04 3.76 ▁▂▇▃▁
Complement_3 0 1 -15.67 2.49 -23.39 -17.57 -15.71 -13.88 -9.56 ▁▆▇▇▂
Complement_Factor_H 0 1 3.52 1.25 -0.84 2.75 3.57 4.25 7.62 ▁▃▇▅▁
Connective_Tissue_Growth_Factor 0 1 0.77 0.21 0.10 0.64 0.79 0.92 1.41 ▁▃▇▃▁
Cortisol 0 1 11.68 3.97 0.10 9.50 12.00 14.00 29.00 ▁▇▇▁▁
Creatine_Kinase_MB 0 1 -1.67 0.09 -1.87 -1.72 -1.67 -1.61 -1.38 ▂▅▇▂▁
Cystatin_C 0 1 8.58 0.40 7.43 8.32 8.56 8.84 9.69 ▁▅▇▅▁
EGF_R 0 1 -0.70 0.23 -1.36 -0.86 -0.68 -0.55 0.19 ▁▆▇▂▁
EN_RAGE 0 1 -3.63 0.90 -8.38 -4.20 -3.65 -3.15 -0.39 ▁▁▇▆▁
ENA_78 0 1 -1.37 0.01 -1.41 -1.38 -1.37 -1.36 -1.34 ▁▅▇▅▁
Eotaxin_3 0 1 57.65 16.36 7.00 44.00 57.00 70.00 107.00 ▁▆▇▅▁
FAS 0 1 -0.53 0.30 -1.51 -0.71 -0.53 -0.31 0.34 ▁▃▇▆▁
FSH_Follicle_Stimulation_Hormon 0 1 -1.13 0.40 -2.12 -1.43 -1.11 -0.86 0.10 ▂▆▇▃▁
Fas_Ligand 0 1 2.90 1.08 -0.15 2.28 2.85 3.58 7.63 ▂▇▇▁▁
Fatty_Acid_Binding_Protein 0 1 1.34 0.78 -1.04 0.80 1.31 1.90 3.71 ▁▅▇▅▁
Ferritin 0 1 2.75 0.78 0.61 2.24 2.73 3.25 4.93 ▁▃▇▅▁
Fetuin_A 0 1 1.34 0.38 0.47 1.06 1.31 1.61 2.25 ▂▆▇▅▂
Fibrinogen 0 1 -7.36 0.58 -9.37 -7.73 -7.32 -7.01 -5.84 ▁▂▇▆▂
GRO_alpha 0 1 1.38 0.04 1.27 1.35 1.38 1.41 1.51 ▂▆▇▂▁
Gamma_Interferon_induced_Monokin 0 1 2.78 0.11 2.39 2.70 2.78 2.86 3.07 ▁▂▇▆▂
Glutathione_S_Transferase_alpha 0 1 0.95 0.16 0.52 0.84 0.97 1.03 1.32 ▁▃▇▆▂
HB_EGF 0 1 6.84 1.49 2.10 5.79 6.70 7.75 10.70 ▁▃▇▆▂
HCC_4 0 1 -3.51 0.36 -4.51 -3.73 -3.54 -3.27 -2.12 ▁▇▇▂▁
Hepatocyte_Growth_Factor_HGF 0 1 0.19 0.30 -0.63 0.00 0.18 0.41 1.10 ▁▅▇▅▁
I_309 0 1 2.95 0.37 1.76 2.71 2.94 3.18 4.14 ▁▅▇▃▁
ICAM_1 0 1 -0.59 0.33 -1.53 -0.80 -0.59 -0.38 0.52 ▁▅▇▃▁
IGF_BP_2 0 1 5.31 0.21 4.63 5.16 5.30 5.44 5.95 ▁▅▇▅▁
IL_11 0 1 4.71 1.42 1.75 3.71 4.81 5.71 8.69 ▃▆▇▃▁
IL_13 0 1 1.28 0.01 1.23 1.27 1.28 1.29 1.32 ▁▁▇▆▁
IL_16 0 1 2.91 0.67 0.96 2.46 2.88 3.35 4.94 ▁▅▇▅▁
IL_17E 0 1 4.84 1.33 1.05 4.15 4.75 5.63 8.95 ▁▆▇▃▁
IL_1alpha 0 1 -7.52 0.40 -8.52 -7.82 -7.54 -7.26 -5.95 ▂▇▇▂▁
IL_3 0 1 -3.95 0.51 -5.91 -4.27 -3.91 -3.61 -2.45 ▁▂▇▆▁
IL_4 0 1 1.77 0.51 0.53 1.46 1.81 2.14 3.04 ▂▆▇▅▂
IL_5 0 1 0.19 0.45 -1.43 -0.09 0.18 0.47 1.95 ▁▃▇▂▁
IL_6 0 1 -0.13 0.55 -1.53 -0.41 -0.09 0.19 1.81 ▂▆▇▂▁
IL_6_Receptor 0 1 0.09 0.32 -0.75 -0.15 0.10 0.35 0.83 ▁▅▇▆▂
IL_7 0 1 2.90 1.02 0.56 2.15 2.92 3.71 5.71 ▃▇▇▆▁
IL_8 0 1 1.70 0.04 1.57 1.68 1.71 1.73 1.84 ▁▃▇▃▁
IP_10_Inducible_Protein_10 0 1 5.73 0.51 4.26 5.38 5.72 6.05 7.50 ▁▆▇▂▁
IgA 0 1 -6.11 0.74 -10.52 -6.65 -6.12 -5.60 -4.20 ▁▁▂▇▂
Insulin 0 1 -1.23 0.34 -2.17 -1.45 -1.25 -1.03 -0.16 ▁▃▇▂▁
Kidney_Injury_Molecule_1_KIM_1 0 1 -1.19 0.03 -1.26 -1.20 -1.18 -1.16 -1.10 ▂▆▇▅▁
LOX_1 0 1 1.27 0.41 0.00 1.03 1.25 1.53 2.40 ▁▃▇▃▁
Leptin 0 1 -1.49 0.27 -2.15 -1.68 -1.47 -1.30 -0.62 ▂▇▇▂▁
Lipoprotein_a 0 1 -4.44 1.07 -6.81 -5.28 -4.63 -3.58 -1.39 ▂▇▅▅▁
MCP_1 0 1 6.49 0.26 5.83 6.32 6.49 6.67 7.23 ▂▅▇▅▁
MCP_2 0 1 1.86 0.64 0.40 1.53 1.85 2.18 4.02 ▂▆▇▂▁
MIF 0 1 -1.88 0.33 -2.85 -2.12 -1.90 -1.66 -0.84 ▁▃▇▃▁
MIP_1alpha 0 1 4.02 1.01 0.93 3.27 3.96 4.69 6.80 ▁▅▇▅▂
MIP_1beta 0 1 2.81 0.38 1.92 2.56 2.83 3.04 4.01 ▂▆▇▃▁
MMP_2 0 1 2.91 0.94 0.10 2.33 2.97 3.55 6.10 ▁▅▇▃▁
MMP_3 0 1 -2.45 0.58 -4.42 -2.78 -2.48 -2.12 -0.53 ▁▂▇▃▁
MMP10 0 1 -3.64 0.43 -4.95 -3.96 -3.61 -3.35 -2.21 ▁▆▇▃▁
MMP7 0 1 -3.83 1.55 -8.40 -4.90 -3.77 -2.85 -0.20 ▁▃▇▅▂
Myoglobin 0 1 -1.38 0.94 -3.30 -2.04 -1.47 -0.78 1.77 ▃▇▆▂▁
NT_proBNP 0 1 4.54 0.38 3.18 4.30 4.54 4.78 5.89 ▁▂▇▃▁
NrCAM 0 1 4.35 0.57 2.64 3.97 4.37 4.74 6.01 ▁▃▇▅▁
Osteopontin 0 1 5.20 0.39 4.08 4.95 5.19 5.44 6.32 ▁▅▇▃▁
PAI_1 0 1 0.06 0.42 -0.99 -0.19 0.09 0.32 1.17 ▂▅▇▅▁
PAPP_A 0 1 -2.85 0.14 -3.31 -2.97 -2.87 -2.74 -2.49 ▁▂▇▆▂
PLGF 0 1 3.91 0.40 2.48 3.64 3.87 4.19 5.17 ▁▃▇▅▁
PYY 0 1 3.01 0.29 2.19 2.83 3.00 3.18 3.93 ▁▅▇▂▁
Pancreatic_polypeptide 0 1 -0.01 0.72 -2.12 -0.51 -0.01 0.53 1.93 ▁▅▇▆▁
Prolactin 0 1 0.05 0.30 -1.31 -0.14 0.00 0.25 0.99 ▁▁▇▇▁
Prostatic_Acid_Phosphatase 0 1 -1.69 0.06 -1.93 -1.72 -1.69 -1.65 -1.42 ▁▂▇▂▁
Protein_S 0 1 -2.25 0.36 -3.34 -2.46 -2.26 -2.00 -1.22 ▁▃▇▃▁
Pulmonary_and_Activation_Regulat 0 1 -1.49 0.45 -2.51 -1.83 -1.51 -1.17 -0.27 ▂▇▇▃▁
RANTES 0 1 -6.52 0.32 -7.24 -6.73 -6.50 -6.32 -5.55 ▃▇▇▃▁
Resistin 0 1 -17.76 5.69 -34.97 -21.47 -17.47 -13.81 -2.24 ▁▃▇▆▁
S100b 0 1 1.24 0.35 0.19 1.00 1.21 1.48 2.37 ▁▆▇▅▁
SGOT 0 1 -0.42 0.35 -1.90 -0.63 -0.40 -0.20 0.74 ▁▂▇▅▁
SHBG 0 1 -2.52 0.57 -4.14 -2.88 -2.51 -2.21 -1.11 ▁▃▇▆▂
SOD 0 1 5.33 0.38 4.32 5.08 5.35 5.58 6.46 ▁▅▇▃▁
Serum_Amyloid_P 0 1 -6.03 0.55 -7.51 -6.44 -6.03 -5.65 -4.65 ▁▅▇▅▂
Sortilin 0 1 3.84 0.87 1.51 3.28 3.87 4.37 6.23 ▁▅▇▃▁
Stem_Cell_Factor 0 1 3.29 0.36 2.22 3.04 3.30 3.53 4.28 ▁▃▇▅▁
TGF_alpha 0 1 9.80 1.28 6.84 8.86 9.83 10.70 13.83 ▂▆▇▂▁
TIMP_1 0 1 11.70 1.85 1.74 10.49 11.56 12.70 18.88 ▁▁▇▅▁
TNF_RII 0 1 -0.60 0.34 -1.66 -0.82 -0.60 -0.36 0.47 ▁▅▇▃▁
TRAIL_R3 0 1 -0.55 0.24 -1.31 -0.70 -0.55 -0.38 0.27 ▁▅▇▃▁
TTR_prealbumin 0 1 2.85 0.14 2.48 2.77 2.83 2.94 3.33 ▂▅▇▃▁
Tamm_Horsfall_Protein_THP 0 1 -3.12 0.03 -3.21 -3.14 -3.12 -3.10 -2.99 ▂▇▇▂▁
Thrombomodulin 0 1 -1.51 0.23 -2.05 -1.68 -1.49 -1.34 -0.82 ▂▇▇▅▁
Thrombopoietin 0 1 -0.75 0.24 -1.54 -0.89 -0.74 -0.63 0.10 ▁▅▇▂▁
Thymus_Expressed_Chemokine_TECK 0 1 3.83 0.79 1.51 3.34 3.81 4.32 6.23 ▁▅▇▃▁
Thyroid_Stimulating_Hormone 0 1 -4.44 0.77 -6.19 -4.89 -4.42 -3.96 -1.71 ▂▇▇▂▁
Thyroxine_Binding_Globulin 0 1 -1.48 0.40 -2.48 -1.77 -1.51 -1.24 -0.21 ▂▆▇▃▁
Tissue_Factor 0 1 1.16 0.51 -0.21 0.79 1.19 1.48 2.71 ▁▅▇▃▁
Transferrin 0 1 2.91 0.28 1.93 2.71 2.89 3.09 3.76 ▁▃▇▅▁
Trefoil_Factor_3_TFF3 0 1 -3.89 0.34 -4.91 -4.14 -3.91 -3.69 -2.96 ▁▅▇▆▁
VCAM_1 0 1 2.68 0.32 1.72 2.48 2.71 2.89 3.69 ▁▃▇▃▁
VEGF 0 1 16.93 1.87 11.83 15.67 17.08 18.10 22.38 ▁▅▇▃▁
Vitronectin 0 1 -0.28 0.33 -1.43 -0.51 -0.29 -0.04 0.53 ▁▂▇▇▂
von_Willebrand_Factor 0 1 -3.93 0.38 -4.99 -4.20 -3.91 -3.65 -2.96 ▁▅▇▆▂
age 0 1 0.99 0.00 0.98 0.99 0.99 0.99 0.99 ▅▇▇▆▁
tau 0 1 5.78 0.56 4.54 5.37 5.75 6.18 7.17 ▃▇▇▆▂
p_tau 0 1 4.04 0.46 2.96 3.72 4.04 4.36 5.47 ▂▇▇▃▁
Ab_42 0 1 12.41 1.57 8.88 11.17 12.45 13.61 15.96 ▃▆▇▇▂
male 0 1 0.39 0.49 0.00 0.00 0.00 1.00 1.00 ▇▁▁▁▅

:::


Initial observations include:

  • 333 rows represent the number of patients in the clinical study
  • We have two factor variables: Genotype and Class
    • In theory, the male variable could be cast as a factor
  • We then have over 100 variables that are all numeric and all complete. So far, we haven’t a clear understanding of the interplay and importance of these data points

Exploratory Data Analysis

Our goal is to learn about the interplay and imporance of the various features in this data set. We’ll use and combine Principal Component and Clustering analysis techniques.

We’ll follow a typical implementation, where we cluster the raw data, perform PCA, and then implement the same clustering analysis on the PCA output. We’ll then determine where the most meaningful interpretation is derived.

Data Prep

In both clustering and PCA, it’s important to have properly scaled variables. This ensures each variable is “on the same level” and not experiencing any bias from large numbers. The {dataPreparation} package helps with this tremendously.

ad_numeric <- ad_raw |> select(where(is.numeric))
scale_obj <- build_scales(data_set = ad_numeric)
[1] "build_scales: I will compute scale on  129 numeric columns."
[1] "build_scales: it took me: 0.02s to compute scale for 129 numeric columns."
ad_scaled <- fast_scale(data_set = ad_numeric, scales = scale_obj, verbose = TRUE)
[1] "fast_scale: I will scale 129 numeric columns."
[1] "fast_scale: it took me: 0s to scale 129 numeric columns."

With these variables properly processed, we can begin our analysis work.

Clustering (phase 1)

Let’s experiment with both a hierarchical & partition-based method for clustering. In this way, we can compare results and see how PCA may impact each individually.

Hierarchical Clustering

We’ll use a standard Euclidean distance function and the ward.D2 method. We can then “cut” the resulting dendrogram to get the number of classes.

d <- dist(ad_scaled, method = "euclidean")
fit_h <- hclust(d, method = "ward.D2")

class_h <- cutree(fit_h, 3)

We generated 3 clusters. There’s several ways to quantify similarity of observations to their cluster. One way is with “average silhouette width”, where values near 1 indicate a tight cluster.

sil_h <- silhouette(class_h, d)
avg_sil_w_h <- mean(sil_h[,3])

avg_sil_w_h
[1] 0.09295359

Partition Clustering

Let’s now cluster using a partition method, like Kernal K-Means.

kk_fit <- kkmeans(as.matrix(ad_scaled), centers = 3)
Using automatic sigma estimation (sigest) for RBF or laplace kernel 

We’ll then calculate the average silhouette width. We can’t use the same distance measure because we’re in a kernel space that differs from Euclidean distance. Let’s generate a new function to create a distance matrix suitable for ‘silhouette()’:

get_kernel_dist <- function(data) {
    kernel_matrix <- kernelMatrix(rbfdot(sigma = 0.1), data)
    pseudo_dist_matrix <- as.dist(sqrt(outer(diag(kernel_matrix), diag(kernel_matrix), "+") - 2 * kernel_matrix))

    return(pseudo_dist_matrix)
}
d_p <- get_kernel_dist(as.matrix(ad_scaled))
sil_p <- silhouette(kk_fit, d_p)
avg_sil_w_p <- mean(sil_p[,3])

avg_sil_w_p
[1] 4.135044e-07

Both clustering algorithms yield nearly the same result: ~0.09. On a scale of -1 to +1, these clustering algorithms aren’t finding three, tight clusters.

Let’s perform PCA and then retry.

Principal Component Analysis

Calculation

We’ll use the scaled data, calculate covariance, and then use eigenvalues & eigenvectors to generate the variance explained by each principal component.

ad_cov <- cov(ad_scaled)
ad_eig <- eigen(ad_cov)

ad_eig_val <- ad_eig$values
ad_eig_vec <- ad_eig$vectors
var_expl <- round(ad_eig_val / sum(ad_eig_val), 3)
cum_var_expl <- cumsum(var_expl)
pca_results <- data.frame(
    var = cum_var_expl, 
    idx = 1:length(cum_var_expl)
)

thresh <- c(0.5, 0.75, 0.9, 0.95)
idx <- sapply(thresh, function(t) which.min(abs(cum_var_expl - t)))

pca_thresh <- data.frame(thresh, idx)

Results Plots

ggplot() +
    geom_area(aes(x = idx, y = var), pca_results, fill = "#E2E6E6") +
    geom_vline(aes(xintercept = idx), pca_thresh, color = "#BE0000") +
    geom_label(
        aes(
            x = idx, y = thresh, 
            label = stringr::str_wrap(
                paste("First", idx, "of", length(cum_var_expl), "principal components explain", thresh * 100, "% of overall variance"), 20
            )
        ), 
        pca_thresh, 
        size = 2, 
        color = "#BE0000"
    ) +
    scale_y_continuous(expand = expansion(mult = c(0,.05))) +
    labs(
        title = "Cumulative variance explained by first X principal component(s)", 
        subtitle = paste0("A summary of PCA results compared to original column dimensions (", length(cum_var_expl), ")"), 
        x = "Principal Component Index", 
        y = "% of Variance Explained"
    ) +
    theme_minimal()

We could also generate a plot comparing the Alzheimer class against the first two principal components and the final two. Let’s get all principal components first and then organize those relevant into a data frame we can easily plot

pca_scores <- ad_scaled * ad_eig_vec
pca_groups <- tibble(
    class = ad_raw$Class, 
    pc_1 = unlist(pca_scores[,1], use.names = FALSE), 
    pc_2 = unlist(pca_scores[,2], use.names = FALSE), 
    pc_114 = unlist(pca_scores[,114], use.names = FALSE), 
    pc_115 = unlist(pca_scores[,115], use.names = FALSE)
)

Let’s now generate the graphs. Hopefully, there’s some visual separation class. Additionally, we would expect to see more distinct separation between class with 1st/2nd dimensions vs 114th/115th.

ggplot(pca_groups) +
    geom_point(
        aes(pc_1, pc_2, color = class)
    ) +
    labs(
        title = "Comparison of principal components by `Class`", 
        subtitle = "Among 1st and 2nd principal components", 
        x = paste0("Dimension 1 (", var_expl[1] * 100, "%)"), 
        y = paste0("Dimension 2 (", var_expl[2] * 100, "%)")
    ) +
    theme_minimal()

ggplot(pca_groups) +
    geom_point(
        aes(pc_114, pc_115, color = class)
    ) +
    labs(
        title = "Comparison of principal components by `Class`", 
        subtitle = "Among 114th and 115th principal components", 
        x = paste0("Dimension 114 (", var_expl[114] * 100, "%)"), 
        y = paste0("Dimension 115 (", var_expl[115] * 100, "%)")
    ) +
    theme_minimal()

The plots look very similar. Even with this dimensionality reduction and using the “ground-truth” class, we’re not seeing a lot of power here. That’s likely to limit our ability to improve clustering in phase 2.

Clustering (phase 2)

Let’s use the PCA values and perform the same clustering from before. We’re looking for values better than ~0.09.

Hierarchical

d2 <- dist(pca_scores, method = "euclidean")
fit_h2 <- hclust(d, method = "ward.D2")
class_h2 <- cutree(fit_h2, 3)

sil_h2 <- silhouette(class_h2, d2)
avg_sil_w_h2 <- mean(sil_h2[,3])

avg_sil_w_h2
[1] -0.01147923

Partition Clustering

kk_fit2 <- kkmeans(as.matrix(pca_scores), centers = 3)
Using automatic sigma estimation (sigest) for RBF or laplace kernel 
d_p2 <- get_kernel_dist(as.matrix(pca_scores))
sil_p2 <- silhouette(kk_fit2, d_p2)
avg_sil_w_p2 <- mean(sil_p2[,3])

avg_sil_w_p2
[1] -0.01459418

Results

We’re seeing worse results in clustering after PCA. This is counter intuitive as PCA should reduce the noise to which clustering may be suceptible.

It is possible that a value of “3” is clearly not the right number of clusters and PCA reinforces this takeaway.