Executive Summary

This analysis explores whether unsupervised learning techniques can identify ideal candidates for Michigan Medicine’s Cancer Clinical Trial.

OBJECTIVE:

The goal is to correctly identify patients who are very likely to have cancer while minimizing False Positives. In other words, we aim to ensure that any patient flagged by our methodology as having cancer truly has cancer, maximizing Precision. At the same time, the method must identify at least seven True Positive patients in the dataset to meet the quorum requirement for the clinical trial.

METHODS:

Principal Components Analysis (PCA) was used to summarize the underlying structure of the health indicators and reduce redundancy. Multiple clustering techniques were applied in the reduced feature space to test whether patient groups naturally separate according to cancer status. Outlier detection methods were also used to identify individuals whose health profiles deviate substantially from the population. The true cancer labels were withheld during modeling and introduced only during evaluation.

RECOMMENDATIONS:

Among the methods tested, Hierarchical clustering method provided the best prediction model with 100% precision, ensuring all cancer cases flagged were true cases. While other methods provide useful insights, they are less reliable for correctly identifying high-risk patients. These findings support the use of unsupervised learning as a preliminary screening tool to identify ideal candidates for the Michigan Medicine cancer treatment clinical trial, with an emphasis on maximizing precision to minimize incorrect diagnoses.

Introduction

Michigan Medicine is piloting a highly promising but very costly cancer treatment clinical trial. Given the high expense and resource requirements, it is critical to ensure that only patients who are very likely to have cancer are selected for participation. The objective of this analysis is to design an Unsupervised Learning methodology that identifies patients with the highest likelihood of having cancer while minimizing False Positives. In other words, our primary goal is to maximize Precision, defined as the proportion of predicted cancer patients who are actually diagnosed with cancer.

Analysis

Data Overview & Preprocessing

cancer_raw <- read.csv("wbc_clustering.csv")
cancer <- cancer_raw
str(cancer)
## 'data.frame':    378 obs. of  31 variables:
##  $ X1 : num  0.3104 0.2887 0.1194 0.2863 0.0575 ...
##  $ X2 : num  0.1573 0.2029 0.0923 0.2946 0.2411 ...
##  $ X3 : num  0.3018 0.2891 0.1144 0.2683 0.0547 ...
##  $ X4 : num  0.1793 0.1597 0.0553 0.1613 0.0248 ...
##  $ X5 : num  0.408 0.495 0.449 0.336 0.301 ...
##  $ X6 : num  0.1899 0.3301 0.1397 0.0561 0.1228 ...
##  $ X7 : num  0.1561 0.107 0.0693 0.06 0.0372 ...
##  $ X8 : num  0.2376 0.1546 0.1032 0.1453 0.0294 ...
##  $ X9 : num  0.417 0.458 0.381 0.206 0.358 ...
##  $ X10: num  0.162 0.382 0.402 0.183 0.317 ...
##  $ X11: num  0.0574 0.0267 0.06 0.0262 0.0162 ...
##  $ X12: num  0.0947 0.0856 0.1363 0.438 0.1318 ...
##  $ X13: num  0.0613 0.0295 0.0543 0.0195 0.0159 ...
##  $ X14: num  0.0313 0.0147 0.01662 0.01374 0.00262 ...
##  $ X15: num  0.2294 0.081 0.2683 0.0897 0.2466 ...
##  $ X16: num  0.0927 0.1256 0.0906 0.0199 0.1067 ...
##  $ X17: num  0.0603 0.0429 0.0501 0.0339 0.0401 ...
##  $ X18: num  0.249 0.123 0.269 0.22 0.112 ...
##  $ X19: num  0.168 0.125 0.174 0.265 0.251 ...
##  $ X20: num  0.0485 0.0529 0.0716 0.0305 0.0583 ...
##  $ X21: num  0.2554 0.2337 0.0818 0.191 0.0368 ...
##  $ X22: num  0.193 0.226 0.097 0.288 0.265 ...
##  $ X23: num  0.2455 0.2275 0.0733 0.1696 0.0341 ...
##  $ X24: num  0.1293 0.1094 0.0319 0.0887 0.014 ...
##  $ X25: num  0.481 0.396 0.404 0.171 0.387 ...
##  $ X26: num  0.1455 0.2429 0.0849 0.0183 0.1052 ...
##  $ X27: num  0.1909 0.151 0.0708 0.0386 0.055 ...
##  $ X28: num  0.4426 0.2503 0.214 0.1723 0.0881 ...
##  $ X29: num  0.2783 0.3191 0.1745 0.0832 0.3036 ...
##  $ X30: num  0.1151 0.1757 0.1488 0.0436 0.125 ...
##  $ y  : int  0 0 0 0 0 0 0 0 0 0 ...

This data set contains patient health data for 378 patients across 30 different health variables and an outcome variable for if they were diagnosed with cancer or not. For this analysis, the outcome indicator will be taken out and later used for evaluation of prediction methods.

summary(cancer)
##        X1               X2               X3               X4        
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.1970   1st Qu.:0.1852   1st Qu.:0.1902   1st Qu.:0.1009  
##  Median :0.2510   Median :0.2650   Median :0.2411   Median :0.1361  
##  Mean   :0.2551   Mean   :0.2860   Mean   :0.2482   Mean   :0.1444  
##  3rd Qu.:0.3111   3rd Qu.:0.3543   3rd Qu.:0.3011   3rd Qu.:0.1789  
##  Max.   :0.7681   Max.   :1.0000   Max.   :0.7581   Max.   :0.6475  
##        X5               X6               X7                X8         
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.2799   1st Qu.:0.1151   1st Qu.:0.04978   1st Qu.:0.07741  
##  Median :0.3586   Median :0.1755   Median :0.09106   Median :0.12020  
##  Mean   :0.3676   Mean   :0.2008   Mean   :0.12538   Mean   :0.14649  
##  3rd Qu.:0.4466   3rd Qu.:0.2588   3rd Qu.:0.15575   3rd Qu.:0.17662  
##  Max.   :1.0000   Max.   :0.7920   Max.   :0.99906   Max.   :0.90606  
##        X9              X10               X11               X12         
##  Min.   :0.0000   Min.   :0.03981   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.2673   1st Qu.:0.18197   1st Qu.:0.03576   1st Qu.:0.09995  
##  Median :0.3346   Median :0.24663   Median :0.05454   Median :0.16560  
##  Mean   :0.3523   Mean   :0.27615   Mean   :0.06823   Mean   :0.19025  
##  3rd Qu.:0.4301   3rd Qu.:0.33798   3rd Qu.:0.08739   3rd Qu.:0.25030  
##  Max.   :0.8500   Max.   :0.96441   Max.   :0.39960   Max.   :1.00000  
##       X13               X14               X15              X16         
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:0.03369   1st Qu.:0.01634   1st Qu.:0.1190   1st Qu.:0.07115  
##  Median :0.05473   Median :0.02463   Median :0.1635   Median :0.10979  
##  Mean   :0.06402   Mean   :0.03136   Mean   :0.1854   Mean   :0.15027  
##  3rd Qu.:0.08137   3rd Qu.:0.03635   3rd Qu.:0.2291   3rd Qu.:0.19355  
##  Max.   :0.43787   Max.   :0.30482   Max.   :0.6818   Max.   :0.78220  
##       X17               X18              X19               X20         
##  Min.   :0.00000   Min.   :0.0000   Min.   :0.02332   Min.   :0.00000  
##  1st Qu.:0.02915   1st Qu.:0.1230   1st Qu.:0.10677   1st Qu.:0.04191  
##  Median :0.04908   Median :0.1743   Median :0.15609   Median :0.06945  
##  Mean   :0.06829   Mean   :0.1920   Mean   :0.17913   Mean   :0.09652  
##  3rd Qu.:0.08315   3rd Qu.:0.2399   3rd Qu.:0.23133   3rd Qu.:0.11536  
##  Max.   :1.00000   Max.   :1.0000   Max.   :0.75390   Max.   :1.00000  
##       X21              X22              X23              X24         
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:0.1538   1st Qu.:0.2033   1st Qu.:0.1433   1st Qu.:0.06694  
##  Median :0.1974   Median :0.2933   Median :0.1842   Median :0.09031  
##  Mean   :0.2071   Mean   :0.3181   Mean   :0.1959   Mean   :0.10201  
##  3rd Qu.:0.2508   3rd Qu.:0.4017   3rd Qu.:0.2364   3rd Qu.:0.12366  
##  Max.   :0.8211   Max.   :0.8755   Max.   :0.7789   Max.   :0.67804  
##       X25              X26               X27               X28        
##  Min.   :0.0000   Min.   :0.00000   Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:0.2665   1st Qu.:0.08636   1st Qu.:0.06407   1st Qu.:0.1856  
##  Median :0.3651   Median :0.14341   Median :0.11829   Median :0.2709  
##  Mean   :0.3657   Mean   :0.16513   Mean   :0.14885   Mean   :0.2791  
##  3rd Qu.:0.4496   3rd Qu.:0.21183   3rd Qu.:0.19399   3rd Qu.:0.3495  
##  Max.   :0.8547   Max.   :1.00000   Max.   :1.00000   Max.   :1.0000  
##       X29                 X30                 y          
##  Min.   :0.0001971   Min.   :0.001115   Min.   :0.00000  
##  1st Qu.:0.1711512   1st Qu.:0.099403   1st Qu.:0.00000  
##  Median :0.2245220   Median :0.148006   Median :0.00000  
##  Mean   :0.2329205   Mean   :0.168817   Mean   :0.05556  
##  3rd Qu.:0.2892273   3rd Qu.:0.212613   3rd Qu.:0.00000  
##  Max.   :0.6227085   Max.   :1.000000   Max.   :1.00000

A numeric summary of each variable confirms there is no need to scale the data moving into the analysis. When variables differ greatly in scale, those with larger numeric ranges dominate distance calculations and have disproportionate influence on PCA, clustering, and outlier detection. Ensuring all variables are on a comparable scale prevents any single measure from overwhelming the analysis and allows patterns to reflect genuine structure in the data rather than artifacts of measurement units.

Below is a check for missing values in the dataset. If meaningful amounts of data are missing then imputation methods would need to be used to continue analysis. However, as the table below shows, there are no missing values in the dataset.

missing_table <- data.frame(
  Missing = sapply(cancer, function(x) sum(is.na(x)))
)
missing_table
##     Missing
## X1        0
## X2        0
## X3        0
## X4        0
## X5        0
## X6        0
## X7        0
## X8        0
## X9        0
## X10       0
## X11       0
## X12       0
## X13       0
## X14       0
## X15       0
## X16       0
## X17       0
## X18       0
## X19       0
## X20       0
## X21       0
## X22       0
## X23       0
## X24       0
## X25       0
## X26       0
## X27       0
## X28       0
## X29       0
## X30       0
## y         0

Test/Train Data Split

A note on splitting data into Test/Train:

Normally, it is good practice in unsupervised learning to split the data into training and test sets to evaluate model performance on unseen data. Due to the small size of the dataset and the limited number of cancer cases, a standard train/test split would likely result in fewer than seven true positives in the test set, making it impossible to meet the clinical trial quorum requirement. To ensure that the analysis can identify at least seven high-likelihood cancer patients, the full dataset will be used for modeling and evaluation. This approach allows for more reliable identification of potential candidates while still prioritizing Precision, which is the primary objective of this analysis.

Dimensionality Reduction (PCA)

PCA dimension reduction is useful here because it compresses many correlated clinical variables into a smaller set of uncorrelated components that retain most of the original variation. This reduces noise, removes redundancy, and improves the stability of distance-based clustering methods. While this dataset only includes 30 patient health variables, in practical applications there would typically be many more, making the role of PCA even more important for producing reliable and interpretable clustering results.

cancer$y <- NULL
cancer.pca <- prcomp(cancer, center = TRUE, scale. = TRUE)
summary(cancer.pca)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     3.3582 2.6065 1.70096 1.48309 1.42571 1.09873 0.86315
## Proportion of Variance 0.3759 0.2265 0.09644 0.07332 0.06776 0.04024 0.02483
## Cumulative Proportion  0.3759 0.6024 0.69883 0.77215 0.83990 0.88014 0.90498
##                           PC8     PC9    PC10   PC11    PC12    PC13    PC14
## Standard deviation     0.6929 0.66264 0.64145 0.5693 0.54596 0.45359 0.43020
## Proportion of Variance 0.0160 0.01464 0.01372 0.0108 0.00994 0.00686 0.00617
## Cumulative Proportion  0.9210 0.93561 0.94933 0.9601 0.97007 0.97693 0.98310
##                           PC15    PC16    PC17    PC18    PC19    PC20    PC21
## Standard deviation     0.31743 0.31151 0.24159 0.20925 0.20575 0.19692 0.16504
## Proportion of Variance 0.00336 0.00323 0.00195 0.00146 0.00141 0.00129 0.00091
## Cumulative Proportion  0.98645 0.98969 0.99163 0.99309 0.99450 0.99580 0.99671
##                           PC22    PC23    PC24    PC25    PC26    PC27    PC28
## Standard deviation     0.16113 0.14315 0.12610 0.12035 0.10033 0.09392 0.04807
## Proportion of Variance 0.00087 0.00068 0.00053 0.00048 0.00034 0.00029 0.00008
## Cumulative Proportion  0.99757 0.99825 0.99878 0.99927 0.99960 0.99990 0.99997
##                           PC29    PC30
## Standard deviation     0.02568 0.01197
## Proportion of Variance 0.00002 0.00000
## Cumulative Proportion  1.00000 1.00000

This PCA produced 30 principal components, but using all of them would undermine the goal of dimension reduction. The components must be narrowed to a smaller subset. The two visualizations below help determine how many components should be retained for the analysis.

Because the stakes are high with this data (supporting cancer diagnosis), the first ten components will be used to retain approximately 95% of the total variance, capturing as much meaningful variation between patients as possible.

Clustering Analysis

Three different methods of clustering will be used and evaluated to predict cancer diagnosis:

  1. K-Means Clustering
  2. Hierachical Clustering
  3. DBSCAN Clustering

1. K-Means Clustering

K-Means Clustering is an unsupervised learning method that groups patients with similar health characteristics into clusters. The algorithm assigns each patient to one of k clusters based on the similarity of their data, so that patients within a cluster are more similar to each other than to patients in other clusters. By examining these clusters, we can identify patterns in the data that may correspond to higher or lower likelihoods of having cancer, without using the outcome variable during the clustering process.

The following visualizations help determine the optimal number of clusters to use for this method.

wssplot <- fviz_nbclust(cancer_pca_10, kmeans, method = "wss")
gapplot <- fviz_nbclust(cancer_pca_10, kmeans, method = "gap_stat")
silplot <- fviz_nbclust(cancer_pca_10, kmeans, method = "silhouette")
plot_grid(wssplot, gapplot, silplot, nrow = 1)

Based on the visualizations above, 2 clusters is optimal. However the first graph on the left shows a prominent “elbow” in the curve at k=4 which could allow for more precise classification of high vs. low risk patients. Moving forward this method will use 4 clusters.

set.seed(12345)
km <- kmeans(cancer_pca_10, centers = 4)
cancer_pca_10_kmeans <- cancer_pca_10
cancer_pca_10_kmeans$cluster <- km$cluster

fviz_cluster(km, data = cancer_pca_10_kmeans)

The plot above provides a 2-D visualization of the four clusters. Clusters 2 and 4 are relatively dense, while cluster 1 and 3 spans a larger area. Although the clusters appear intertwined in this view, the underlying analysis occurs in a 10-dimensional space. Reducing ten dimensions down to two inevitably compresses and overlaps information, so this visualization cannot fully represent the true separation between clusters.

Below is a visualization of the same clusters but colored by cancer diagnosis, with red points signaling a patient with cancer. We see that the majority of red points belong to cluster 3, a couple belong to cluster 2, a single point belongs to cluster 1, and zero red points in cluster 4. Based on this visual alone, we can infer that cluster 3 is our high risk cancer group, cluster 2 is a medium risk cancer group, and cluster 1 is a low risk cancer group, and cluster 4 is essentially our non-cancer group.

tapply(cancer_pca_10_kmeans$outcome, cancer_pca_10_kmeans$cluster, mean)
##          1          2          3          4 
## 0.02409639 0.03539823 0.78947368 0.00000000

The table above confirms our theory from the previous visualization. Cluster 3 has 78.9% patients that have cancer, cluster 2 has 3.5% cancer patients, and cluster 1 has less than 2.4% and cluster 4 has 0%. Cluster 3 is our high-risk of cancer cluster that we will use in our prediction model to classify individuals with cancer vs. no cancer.

Now, a confusion matrix and precision metric will be used to evaluate the k-means clustering prediction model.

# creates vector of predicted diagnosis for each observation
predicted <- ifelse(km$cluster == 3, 1, 0)
predicted_f <- factor(predicted, levels = c(0,1))
outcome_f <- factor(cancer_pca_10_kmeans$outcome, levels = c(0,1))

kmeans_results <- confusionMatrix(predicted_f, outcome_f, positive = "1")

kmeans_results$table
##           Reference
## Prediction   0   1
##          0 353   6
##          1   4  15
kmeans_results$byClass["Precision"]
## Precision 
## 0.7894737

The K-Means method predicted cancer in this dataset with a precision of 78.9%. This means that, among the 19 patients that the model flagged as having cancer, the model correctly identified 15 true positives but incorrectly flagged 4 patients (4 false positives). This method does satisfy the 7 true positive quorum.

2. Hierarchical Clustering

Hierarchical clustering is an unsupervised learning method that groups patients based on how similar they are to one another, without using diagnosis information. Unlike K-Means, which requires pre-selecting the number of clusters, hierarchical clustering builds a tree-like structure (a dendrogram) that shows how patients merge into clusters at different similarity levels. By examining this structure, we can choose a meaningful number of clusters and understand how distinct or cohesive the groups are. This makes hierarchical clustering useful for exploring natural patterns in patient data and identifying groups that may share similar risk profiles.

This code compares different linkage methods used in hierarchical clustering by calculating their agglomerative coefficients. The agglomerative coefficient (AC) measures how well the data fit the clustering structure, with higher values indicating more cohesive clusters.

#Agglomerative coefficient
Average <- round(agnes(cancer_pca_10, method = "average")$ac, 2)
Complete <- round(agnes(cancer_pca_10, method = "complete")$ac, 2)
Single <- round(agnes(cancer_pca_10,, method = "single")$ac, 2)
Ward <- round(agnes(cancer_pca_10,, method = "ward")$ac, 2)
kable(data.frame(Average, Complete, Single, Ward))
Average Complete Single Ward
0.86 0.91 0.77 0.96

Based on the results, this analysis will use the Ward AC for the strongest clustering structure. Below is the full dendrogram for the training data.

hc <- agnes(cancer_pca_10, method = "ward")
fviz_dend(hc)

Again, visualizations are used to help determine the optimal number of clusters to use for this analysis.

wssplot <- fviz_nbclust(cancer_pca_10, hcut, method = "wss")
gapplot <- fviz_nbclust(cancer_pca_10, hcut, method = "gap_stat")
silplot <- fviz_nbclust(cancer_pca_10, hcut, method = "silhouette")
plot_grid(wssplot, gapplot, silplot, nrow = 1)

Based on the middle and right plots, either 1 or 2 clusters appear optimal, but the first plot justifies the use of 3 clusters. In the first plot, k=3 represents the first clear “elbow,” after which increases in the number of clusters produce only minimal improvement. This makes k=3 a reasonable choice if the goal is to capture additional structure in the data without overfitting.

# Cut dendogram into 2 groups
hc_clusters <- cutree(hc, k = 3)
# Adding rectangles on dendogran to show differen clusters
fviz_dend(hc, k=3, rect = TRUE)

Above is the same dendrogram as before but this time colored by the 3 cluster split.

The plot below shows a 2-D projection of the three clusters.

fviz_cluster(list(data = cancer_pca_10, cluster = hc_clusters, repel = TRUE))

Below is a visualization of the same clusters but colored by cancer diagnosis, with red points signaling a patient with cancer.

We see that most red points fall into clusters 1 and 3, with only one or two appearing in cluster 2. Based on this visual alone, cluster 3 can be interpreted as the high-risk cancer group, cluster 1 as a medium-risk group, and cluster 2 as essentially a non-cancer group.

tapply(cancer_pca_10_hierarch$outcome, cancer_pca_10_hierarch$cluster, mean)
##           1           2           3 
## 0.073170732 0.004854369 1.000000000

This table confirms that cluster 3 is unequivocally a cancer cluster, with 100% of patients in that cluster having cancer. Cluster 1 represents a medium-risk group, with nearly 7.3% of its patients having cancer. Because minimizing false positives is the priority, only clusters 3 will be flagged as cancer-risk groups in the prediction model

Now, a confusion matrix and precision metric will be used to evaluate the hierarchical clustering prediction model.

# creates vector of predicted diagnosis for each observation
predicted <- ifelse(cancer_pca_10_hierarch$cluster == 3, 1, 0)
predicted_f <- factor(predicted, levels = c(0,1))
outcome_f <- factor(cancer_pca_10_hierarch$outcome, levels = c(0,1))

hier_results <- confusionMatrix(predicted_f, outcome_f, positive = "1")

hier_results$table
##           Reference
## Prediction   0   1
##          0 357  13
##          1   0   8
hier_results$byClass["Precision"]
## Precision 
##         1

The hierarchical method predicted cancer in this dataset with a precision of 100%. This means that, among the 8 patients that the model flagged as having cancer, all 8 were diagnosed with cancer (true positives) and there are 0 false positives This model does have a high amount of false negatives but for the purpose of this project, minimizing false positives is the priority and this method does so very effectively. This method does satisfy the 7 true positive quorum.

3. DBSCAN Clustering

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a clustering method that groups together data points that are close to one another in dense regions while identifying isolated points as noise or outliers. Instead of forcing every data point into a cluster, DBSCAN focuses on areas where points naturally accumulate. This makes it especially useful when the data contains irregular shapes, varying cluster sizes, or noise. It helps reveal meaningful structure without requiring the number of clusters to be chosen in advance.

With DBSCAN, a wide-accepted rule for minPts argument is at least number of dimensions+1. Because there are 10 principal components, k=11 will be sufficient.

cancer_pca_10_dbscan <- cancer_pca_10
dbscan::kNNdistplot(cancer_pca_10_dbscan, k = 11); abline(h = 6.4, lty = 2)

The function above calculates, for each point, the distance to its 11th nearest neighbor and plots these distances in ascending order. This plot is used to identify the radius at which points begin to transition from dense regions to sparser regions. The optimal eps value typically appears at the point where the curve shows a clear bend or “elbow.” In this case, the sharpest elbow occurs at approximately 6.4, indicated by the horizontal reference line, which serves as a reasonable choice for the DBSCAN radius parameter.

set.seed(12345)
db <- fpc::dbscan(cancer_pca_10_dbscan, eps = 6.4, MinPts = 11)
fviz_cluster(db, data = cancer_pca_10_dbscan, stand = FALSE, geom = "point",
ellipse = FALSE, show.clust.cent = FALSE, label = "point", repel = TRUE) + theme_classic()

A radius of 6.4 and a minimum of 11 points produces the clustering shown above. Under these parameters, DBSCAN does not form multiple well-separated clusters; instead, it identifies one large primary cluster and a small group of points labeled as outliers. As shown in the table below, 11 data points were classified as outliers.

table(db$cluster)
## 
##   0   1 
##  11 367
# table of % cancer by clusters
cancer_pca_10_dbscan$clusters <- db$cluster
cancer_pca_10_dbscan$outcome <- cancer_raw$y

cluster_means <- cancer_pca_10_dbscan %>%
  group_by(clusters) %>%
  summarize(cancer_rate = mean(outcome))
cluster_means
## # A tibble: 2 × 2
##   clusters cancer_rate
##      <dbl>       <dbl>
## 1        0      0.727 
## 2        1      0.0354

With these parameters, 72.7% of the “outlier” group are cancer patients with only 3.5% cancer patients in the main cluster. In our prediction model, these outlier points will be treated as the predicted cancer cases.

Now, a confusion matrix and precision metric will be used to evaluate the DBSCAN clustering prediction model.

# creates vector of predicted diagnosis for each observation
predicted <- ifelse(cancer_pca_10_dbscan$cluster == 0, 1, 0)
predicted_f <- factor(predicted, levels = c(0,1))
outcome_f <- factor(cancer_pca_10_dbscan$outcome, levels = c(0,1))

dbscan_results <- confusionMatrix(predicted_f, outcome_f, positive = "1")

dbscan_results$table
##           Reference
## Prediction   0   1
##          0 354  13
##          1   3   8
dbscan_results$byClass["Precision"]
## Precision 
## 0.7272727

The DBSCAN method predicted cancer in this dataset with a precision of 72.7%. This means that, among the 11 patients that the model flagged as having cancer, 8 were actually diagnosed with cancer (true positives) but 3 were incorrectly flagged (false positives). This model also has a high amount of false negatives but for the purpose of this project, minimizing false positives is the priority. This method does satisfy the 7 true positive quorum.

Outlier Detection Analysis

Two different methods of outlier detection will be used and evaluated on ability to predict cancer diagnosis:

  1. KNN
  2. Isolation Forest

KNN

K-Nearest Neighbors (KNN) outlier detection identifies unusual patients by comparing each patient’s data to their nearest neighbors. If a patient’s measurements are very different from those of their closest neighbors, they receive a high “outlier score,” indicating a higher likelihood of being atypical. In this context, patients with the highest outlier scores are flagged as potential cancer cases.

cancer_pca_10_knn <- cancer_pca_10
df_knn <- get.knn(data = cancer_pca_10_knn, k = 10)
cancer_pca_10_knn$knnscore <- rowMeans(df_knn$nn.dist)

ggplot(cancer_pca_10_knn) + aes(x=PC1, y=knnscore) +
geom_point(size = 5, alpha = 0.5, color = "darkgreen")

This code calculates a KNN outlier score for each patient by averaging the distances to their 10 nearest neighbors. The scatter plot then visualizes these scores along the first principal component, highlighting patients who are most atypical in the dataset. The visual below visualizes the predictions of potential cancer cases with KNN scores above 9 flagged as “cancer”.

Now, a confusion matrix and precision metric will be used to evaluate the KNN outlier detection prediction model.

# creates vector of predicted diagnosis for each observation
predicted_f <- factor(cancer_pca_10_knn$predicted, levels = c(0,1))
outcome_f <- factor(cancer_raw$y, levels = c(0,1))

knn_results <- confusionMatrix(predicted_f, outcome_f, positive = "1")

knn_results$table
##           Reference
## Prediction   0   1
##          0 355  16
##          1   2   5
knn_results$byClass["Precision"]
## Precision 
## 0.7142857

The KNN outlier detection method predicted cancer in this dataset with a precision of 71.4%. This means that, among the 7 patients that the model flagged as having cancer, 5 were actually diagnosed with cancer (true positives) but 2 were incorrectly flagged (false positives). This method does not satisfy the 7 true positive quorum.

Isolation Forest

Isolation Forest is an unsupervised algorithm that identifies unusual or atypical patients by isolating them from the rest of the data. It works by randomly partitioning the data and measuring how quickly each patient can be separated from others. Patients that are isolated more quickly receive higher anomaly scores, indicating a greater likelihood of being a potential cancer case.

cancer_pca_10_isof <- cancer_pca_10
set.seed(12345)

iso <- isolationForest$new(
  sample_size = 256,
  num_trees   = 100
)

iso$fit(cancer_pca_10_isof)
iso_pred <- iso$predict(cancer_pca_10_isof)
cancer_pca_10_isof$iso_score = iso_pred$anomaly_score
ggplot(cancer_pca_10_isof) + aes(x=iso_score) + geom_density()

The density plot above helps determine the best threshold of isolation score to consider points outliers or not. Here we choose 0.67.

cancer_pca_10_isof$predicted <- as.factor(ifelse(cancer_pca_10_isof$iso_score >= 0.67, 1, 0))
ggplot(cancer_pca_10_isof) + aes(x = PC1, y = PC2, color = predicted) + geom_point(size = 5, alpha = 0.5) +
geom_text(aes(label = row.names(cancer_pca_10_isof)), hjust = 1 , vjust = -1 ,size = 3 ) +
theme_minimal()

The familiar visual above displays points in PC dimension space but with predictions of potential cancer cases with isolation scores above 0.67 colored blue indicating prediction of cancer.

Now, a confusion matrix and precision metric will be used to evaluate the isolation forest outlier detection prediction model.

# creates vector of predicted diagnosis for each observation
predicted_f <- factor(cancer_pca_10_isof$predicted, levels = c(0,1))
outcome_f <- factor(cancer_raw$y, levels = c(0,1))

isof_results <- confusionMatrix(predicted_f, outcome_f, positive = "1")

isof_results$table
##           Reference
## Prediction   0   1
##          0 352  19
##          1   5   2
isof_results$byClass["Precision"]
## Precision 
## 0.2857143

The isolation forest outlier detection method predicted cancer in this dataset with a precision of 28.6%. This means that, among the 7 patients that the model flagged as having cancer, 2 were actually diagnosed with cancer (true positives) but 5 were incorrectly flagged (false positives). This is by far the worst performing prediction model of the group. This method does not satisfy the 7 true positive quorum.

Comparative Performance of Methods

Below is a confusion matrix for each of the five methods explored in this analysis: k-means clustering, hierarchical clustering, DBSCAN clustering, KNN outlier detection, and isolation forest outlier detection.

Red boxes indicate the rate of False Positives, cases where a patient does not actually has cancer, but the method predicted they do. In this healthcare context, minimizing False Positives is critical given the expensive nature of MM’s cancer treatment trials. Here we see that hierarchical clustering method was the only method that was able to predict with zero false positives.

Below is a table showing each method’s performance in terms of accuracy and sensitivity. Sensitivity is particularly important in this context because it measures the method’s ability to correctly identify patients who actually have cancer, helping to minimize missed diagnoses and reduce the risk of undetected cases.

##            method  accuracy precision
## 1          KMeans 0.9735450 0.7894737
## 2    Hierarchical 0.9656085 1.0000000
## 3          DBSCAN 0.9576720 0.7272727
## 4             KNN 0.9523810 0.7142857
## 5 IsolationForest 0.9365079 0.2857143

In order to provide more context on content of the table, below is a visualization of these metrics.

As seen in the table and bar graph above, all methods achieved similar levels of overall accuracy, but their precision varied considerably. The Isolation Forest method performed the worst, with a precision of only 28.6%, indicating it incorrectly flagged many non-cancer cases. K-Means, DBSCAN, and KNN showed moderate precision between 71% and 79%, an improvement over Isolation Forest but still seeing a substantial number of false positive cases. Hierarchical clustering outperformed all other methods, achieving a precision of 100%, meaning it identified zero false positives. This highlights its potential as the most reliable method for identifying ideal patients as candidates for the cancer treatment clinical trial.

Recommendations & Takeaways

Based on the analyses of PCA, K-Means, Hierarchical Clustering, DBSCAN, KNN outlier detection, and Isolation Forest, the patient health data shows promise for optimal selection of clinical trial candidacy. Among the five methods tested, Hierarchical Clustering demonstrated the highest precision of 100%, which is critical in this healthcare context to minimize chances of sending a non-cancerous patient into the expensive clinical trial. While other methods provide useful insights, they are less reliable for correctly identifying cancer cases.

Key Takeaways:

  • Prioritize Hierarchical Clustering: For operational use, Hierarchical Clustering provides the best balance of precision and interpretability, ensuring all flagged patients are likely to be true cancer cases.

  • Leverage PCA for dimensionality reduction: Reducing the number of variables while retaining key variance improves clustering stability and interpretability. This is especially important when more health indicator variables that are likely highly correlated with one another are added to the analysis in a real life application.

Limitations & Next Steps

While the analyses demonstrate that unsupervised learning methods can help identify cancer cases, several limitations should be considered:

  • No train/test split, reduced generalizability: Because the full dataset was used to ensure at least seven true positive identifications, model performance could not be evaluated on unseen data. As a result, precision estimates may be optimistic, and real-world performance may differ when applied to new patient populations.

  • Limited variables: The current dataset includes only 30 patient health indicators, which may not capture the full complexity of cancer risk. Real-world applications would likely involve many more features.

  • Small dataset: With only 378 observations, performance metrics such as sensitivity and accuracy are subject to higher variability. Larger datasets would provide more reliable estimates.

  • Parameter sensitivity: Method outcomes depend on hyperparameter choices, including the number of clusters, k in KNN, and eps in DBSCAN. Careful tuning is required for robust results.

  • Dimensionality reduction trade-offs: While PCA helps reduce complexity, it may obscure subtle differences between patients, potentially affecting the detection of nuanced patterns.

Next steps:

  • Expand dataset and features: Incorporate additional health indicators, longitudinal measurements, and demographic variables to improve clustering accuracy and robustness.

  • Refine method selection and tuning: Systematically optimize parameters for each method, potentially combining multiple approaches to maximize sensitivity.

  • Validate on external cohorts: Test the models on independent patient populations to assess generalizability and real-world performance.

Some code used in this analysis is not visible in order to keep the deliverable clean for readability. All code can be found in the RMD file attached. Please direct any and all questions to Whitney Zhang, .