This analysis explores whether unsupervised learning techniques can identify ideal candidates for Michigan Medicine’s Cancer Clinical Trial.
OBJECTIVE:
The goal is to correctly identify patients who are very likely to have cancer while minimizing False Positives. In other words, we aim to ensure that any patient flagged by our methodology as having cancer truly has cancer, maximizing Precision. At the same time, the method must identify at least seven True Positive patients in the dataset to meet the quorum requirement for the clinical trial.
METHODS:
Principal Components Analysis (PCA) was used to summarize the underlying structure of the health indicators and reduce redundancy. Multiple clustering techniques were applied in the reduced feature space to test whether patient groups naturally separate according to cancer status. Outlier detection methods were also used to identify individuals whose health profiles deviate substantially from the population. The true cancer labels were withheld during modeling and introduced only during evaluation.
RECOMMENDATIONS:
Among the methods tested, Hierarchical clustering method provided the best prediction model with 100% precision, ensuring all cancer cases flagged were true cases. While other methods provide useful insights, they are less reliable for correctly identifying high-risk patients. These findings support the use of unsupervised learning as a preliminary screening tool to identify ideal candidates for the Michigan Medicine cancer treatment clinical trial, with an emphasis on maximizing precision to minimize incorrect diagnoses.
Michigan Medicine is piloting a highly promising but very costly cancer treatment clinical trial. Given the high expense and resource requirements, it is critical to ensure that only patients who are very likely to have cancer are selected for participation. The objective of this analysis is to design an Unsupervised Learning methodology that identifies patients with the highest likelihood of having cancer while minimizing False Positives. In other words, our primary goal is to maximize Precision, defined as the proportion of predicted cancer patients who are actually diagnosed with cancer.
## 'data.frame': 378 obs. of 31 variables:
## $ X1 : num 0.3104 0.2887 0.1194 0.2863 0.0575 ...
## $ X2 : num 0.1573 0.2029 0.0923 0.2946 0.2411 ...
## $ X3 : num 0.3018 0.2891 0.1144 0.2683 0.0547 ...
## $ X4 : num 0.1793 0.1597 0.0553 0.1613 0.0248 ...
## $ X5 : num 0.408 0.495 0.449 0.336 0.301 ...
## $ X6 : num 0.1899 0.3301 0.1397 0.0561 0.1228 ...
## $ X7 : num 0.1561 0.107 0.0693 0.06 0.0372 ...
## $ X8 : num 0.2376 0.1546 0.1032 0.1453 0.0294 ...
## $ X9 : num 0.417 0.458 0.381 0.206 0.358 ...
## $ X10: num 0.162 0.382 0.402 0.183 0.317 ...
## $ X11: num 0.0574 0.0267 0.06 0.0262 0.0162 ...
## $ X12: num 0.0947 0.0856 0.1363 0.438 0.1318 ...
## $ X13: num 0.0613 0.0295 0.0543 0.0195 0.0159 ...
## $ X14: num 0.0313 0.0147 0.01662 0.01374 0.00262 ...
## $ X15: num 0.2294 0.081 0.2683 0.0897 0.2466 ...
## $ X16: num 0.0927 0.1256 0.0906 0.0199 0.1067 ...
## $ X17: num 0.0603 0.0429 0.0501 0.0339 0.0401 ...
## $ X18: num 0.249 0.123 0.269 0.22 0.112 ...
## $ X19: num 0.168 0.125 0.174 0.265 0.251 ...
## $ X20: num 0.0485 0.0529 0.0716 0.0305 0.0583 ...
## $ X21: num 0.2554 0.2337 0.0818 0.191 0.0368 ...
## $ X22: num 0.193 0.226 0.097 0.288 0.265 ...
## $ X23: num 0.2455 0.2275 0.0733 0.1696 0.0341 ...
## $ X24: num 0.1293 0.1094 0.0319 0.0887 0.014 ...
## $ X25: num 0.481 0.396 0.404 0.171 0.387 ...
## $ X26: num 0.1455 0.2429 0.0849 0.0183 0.1052 ...
## $ X27: num 0.1909 0.151 0.0708 0.0386 0.055 ...
## $ X28: num 0.4426 0.2503 0.214 0.1723 0.0881 ...
## $ X29: num 0.2783 0.3191 0.1745 0.0832 0.3036 ...
## $ X30: num 0.1151 0.1757 0.1488 0.0436 0.125 ...
## $ y : int 0 0 0 0 0 0 0 0 0 0 ...
This data set contains patient health data for 378 patients across 30 different health variables and an outcome variable for if they were diagnosed with cancer or not. For this analysis, the outcome indicator will be taken out and later used for evaluation of prediction methods.
## X1 X2 X3 X4
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.1970 1st Qu.:0.1852 1st Qu.:0.1902 1st Qu.:0.1009
## Median :0.2510 Median :0.2650 Median :0.2411 Median :0.1361
## Mean :0.2551 Mean :0.2860 Mean :0.2482 Mean :0.1444
## 3rd Qu.:0.3111 3rd Qu.:0.3543 3rd Qu.:0.3011 3rd Qu.:0.1789
## Max. :0.7681 Max. :1.0000 Max. :0.7581 Max. :0.6475
## X5 X6 X7 X8
## Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.2799 1st Qu.:0.1151 1st Qu.:0.04978 1st Qu.:0.07741
## Median :0.3586 Median :0.1755 Median :0.09106 Median :0.12020
## Mean :0.3676 Mean :0.2008 Mean :0.12538 Mean :0.14649
## 3rd Qu.:0.4466 3rd Qu.:0.2588 3rd Qu.:0.15575 3rd Qu.:0.17662
## Max. :1.0000 Max. :0.7920 Max. :0.99906 Max. :0.90606
## X9 X10 X11 X12
## Min. :0.0000 Min. :0.03981 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.2673 1st Qu.:0.18197 1st Qu.:0.03576 1st Qu.:0.09995
## Median :0.3346 Median :0.24663 Median :0.05454 Median :0.16560
## Mean :0.3523 Mean :0.27615 Mean :0.06823 Mean :0.19025
## 3rd Qu.:0.4301 3rd Qu.:0.33798 3rd Qu.:0.08739 3rd Qu.:0.25030
## Max. :0.8500 Max. :0.96441 Max. :0.39960 Max. :1.00000
## X13 X14 X15 X16
## Min. :0.00000 Min. :0.00000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.03369 1st Qu.:0.01634 1st Qu.:0.1190 1st Qu.:0.07115
## Median :0.05473 Median :0.02463 Median :0.1635 Median :0.10979
## Mean :0.06402 Mean :0.03136 Mean :0.1854 Mean :0.15027
## 3rd Qu.:0.08137 3rd Qu.:0.03635 3rd Qu.:0.2291 3rd Qu.:0.19355
## Max. :0.43787 Max. :0.30482 Max. :0.6818 Max. :0.78220
## X17 X18 X19 X20
## Min. :0.00000 Min. :0.0000 Min. :0.02332 Min. :0.00000
## 1st Qu.:0.02915 1st Qu.:0.1230 1st Qu.:0.10677 1st Qu.:0.04191
## Median :0.04908 Median :0.1743 Median :0.15609 Median :0.06945
## Mean :0.06829 Mean :0.1920 Mean :0.17913 Mean :0.09652
## 3rd Qu.:0.08315 3rd Qu.:0.2399 3rd Qu.:0.23133 3rd Qu.:0.11536
## Max. :1.00000 Max. :1.0000 Max. :0.75390 Max. :1.00000
## X21 X22 X23 X24
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.1538 1st Qu.:0.2033 1st Qu.:0.1433 1st Qu.:0.06694
## Median :0.1974 Median :0.2933 Median :0.1842 Median :0.09031
## Mean :0.2071 Mean :0.3181 Mean :0.1959 Mean :0.10201
## 3rd Qu.:0.2508 3rd Qu.:0.4017 3rd Qu.:0.2364 3rd Qu.:0.12366
## Max. :0.8211 Max. :0.8755 Max. :0.7789 Max. :0.67804
## X25 X26 X27 X28
## Min. :0.0000 Min. :0.00000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.2665 1st Qu.:0.08636 1st Qu.:0.06407 1st Qu.:0.1856
## Median :0.3651 Median :0.14341 Median :0.11829 Median :0.2709
## Mean :0.3657 Mean :0.16513 Mean :0.14885 Mean :0.2791
## 3rd Qu.:0.4496 3rd Qu.:0.21183 3rd Qu.:0.19399 3rd Qu.:0.3495
## Max. :0.8547 Max. :1.00000 Max. :1.00000 Max. :1.0000
## X29 X30 y
## Min. :0.0001971 Min. :0.001115 Min. :0.00000
## 1st Qu.:0.1711512 1st Qu.:0.099403 1st Qu.:0.00000
## Median :0.2245220 Median :0.148006 Median :0.00000
## Mean :0.2329205 Mean :0.168817 Mean :0.05556
## 3rd Qu.:0.2892273 3rd Qu.:0.212613 3rd Qu.:0.00000
## Max. :0.6227085 Max. :1.000000 Max. :1.00000
A numeric summary of each variable confirms there is no need to scale the data moving into the analysis. When variables differ greatly in scale, those with larger numeric ranges dominate distance calculations and have disproportionate influence on PCA, clustering, and outlier detection. Ensuring all variables are on a comparable scale prevents any single measure from overwhelming the analysis and allows patterns to reflect genuine structure in the data rather than artifacts of measurement units.
Below is a check for missing values in the dataset. If meaningful amounts of data are missing then imputation methods would need to be used to continue analysis. However, as the table below shows, there are no missing values in the dataset.
## Missing
## X1 0
## X2 0
## X3 0
## X4 0
## X5 0
## X6 0
## X7 0
## X8 0
## X9 0
## X10 0
## X11 0
## X12 0
## X13 0
## X14 0
## X15 0
## X16 0
## X17 0
## X18 0
## X19 0
## X20 0
## X21 0
## X22 0
## X23 0
## X24 0
## X25 0
## X26 0
## X27 0
## X28 0
## X29 0
## X30 0
## y 0
A note on splitting data into Test/Train:
Normally, it is good practice in unsupervised learning to split the data into training and test sets to evaluate model performance on unseen data. Due to the small size of the dataset and the limited number of cancer cases, a standard train/test split would likely result in fewer than seven true positives in the test set, making it impossible to meet the clinical trial quorum requirement. To ensure that the analysis can identify at least seven high-likelihood cancer patients, the full dataset will be used for modeling and evaluation. This approach allows for more reliable identification of potential candidates while still prioritizing Precision, which is the primary objective of this analysis.
PCA dimension reduction is useful here because it compresses many correlated clinical variables into a smaller set of uncorrelated components that retain most of the original variation. This reduces noise, removes redundancy, and improves the stability of distance-based clustering methods. While this dataset only includes 30 patient health variables, in practical applications there would typically be many more, making the role of PCA even more important for producing reliable and interpretable clustering results.
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 3.3582 2.6065 1.70096 1.48309 1.42571 1.09873 0.86315
## Proportion of Variance 0.3759 0.2265 0.09644 0.07332 0.06776 0.04024 0.02483
## Cumulative Proportion 0.3759 0.6024 0.69883 0.77215 0.83990 0.88014 0.90498
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 0.6929 0.66264 0.64145 0.5693 0.54596 0.45359 0.43020
## Proportion of Variance 0.0160 0.01464 0.01372 0.0108 0.00994 0.00686 0.00617
## Cumulative Proportion 0.9210 0.93561 0.94933 0.9601 0.97007 0.97693 0.98310
## PC15 PC16 PC17 PC18 PC19 PC20 PC21
## Standard deviation 0.31743 0.31151 0.24159 0.20925 0.20575 0.19692 0.16504
## Proportion of Variance 0.00336 0.00323 0.00195 0.00146 0.00141 0.00129 0.00091
## Cumulative Proportion 0.98645 0.98969 0.99163 0.99309 0.99450 0.99580 0.99671
## PC22 PC23 PC24 PC25 PC26 PC27 PC28
## Standard deviation 0.16113 0.14315 0.12610 0.12035 0.10033 0.09392 0.04807
## Proportion of Variance 0.00087 0.00068 0.00053 0.00048 0.00034 0.00029 0.00008
## Cumulative Proportion 0.99757 0.99825 0.99878 0.99927 0.99960 0.99990 0.99997
## PC29 PC30
## Standard deviation 0.02568 0.01197
## Proportion of Variance 0.00002 0.00000
## Cumulative Proportion 1.00000 1.00000
This PCA produced 30 principal components, but using all of them would undermine the goal of dimension reduction. The components must be narrowed to a smaller subset. The two visualizations below help determine how many components should be retained for the analysis.
Because the stakes are high with this data (supporting cancer diagnosis), the first ten components will be used to retain approximately 95% of the total variance, capturing as much meaningful variation between patients as possible.
Three different methods of clustering will be used and evaluated to predict cancer diagnosis:
K-Means Clustering is an unsupervised learning method that groups patients with similar health characteristics into clusters. The algorithm assigns each patient to one of k clusters based on the similarity of their data, so that patients within a cluster are more similar to each other than to patients in other clusters. By examining these clusters, we can identify patterns in the data that may correspond to higher or lower likelihoods of having cancer, without using the outcome variable during the clustering process.
The following visualizations help determine the optimal number of clusters to use for this method.
wssplot <- fviz_nbclust(cancer_pca_10, kmeans, method = "wss")
gapplot <- fviz_nbclust(cancer_pca_10, kmeans, method = "gap_stat")
silplot <- fviz_nbclust(cancer_pca_10, kmeans, method = "silhouette")
plot_grid(wssplot, gapplot, silplot, nrow = 1)Based on the visualizations above, 2 clusters is optimal. However the first graph on the left shows a prominent “elbow” in the curve at k=4 which could allow for more precise classification of high vs. low risk patients. Moving forward this method will use 4 clusters.
set.seed(12345)
km <- kmeans(cancer_pca_10, centers = 4)
cancer_pca_10_kmeans <- cancer_pca_10
cancer_pca_10_kmeans$cluster <- km$cluster
fviz_cluster(km, data = cancer_pca_10_kmeans)The plot above provides a 2-D visualization of the four clusters. Clusters 2 and 4 are relatively dense, while cluster 1 and 3 spans a larger area. Although the clusters appear intertwined in this view, the underlying analysis occurs in a 10-dimensional space. Reducing ten dimensions down to two inevitably compresses and overlaps information, so this visualization cannot fully represent the true separation between clusters.
Below is a visualization of the same clusters but colored by cancer
diagnosis, with red points signaling a patient with cancer.
We see that the majority of red points belong to cluster 3, a couple
belong to cluster 2, a single point belongs to cluster 1, and zero red
points in cluster 4. Based on this visual alone, we can infer that
cluster 3 is our high risk cancer group, cluster 2 is a medium risk
cancer group, and cluster 1 is a low risk cancer group, and cluster 4 is
essentially our non-cancer group.
## 1 2 3 4
## 0.02409639 0.03539823 0.78947368 0.00000000
The table above confirms our theory from the previous visualization. Cluster 3 has 78.9% patients that have cancer, cluster 2 has 3.5% cancer patients, and cluster 1 has less than 2.4% and cluster 4 has 0%. Cluster 3 is our high-risk of cancer cluster that we will use in our prediction model to classify individuals with cancer vs. no cancer.
Now, a confusion matrix and precision metric will be used to evaluate the k-means clustering prediction model.
# creates vector of predicted diagnosis for each observation
predicted <- ifelse(km$cluster == 3, 1, 0)
predicted_f <- factor(predicted, levels = c(0,1))
outcome_f <- factor(cancer_pca_10_kmeans$outcome, levels = c(0,1))
kmeans_results <- confusionMatrix(predicted_f, outcome_f, positive = "1")
kmeans_results$table## Reference
## Prediction 0 1
## 0 353 6
## 1 4 15
## Precision
## 0.7894737
The K-Means method predicted cancer in this dataset with a precision of 78.9%. This means that, among the 19 patients that the model flagged as having cancer, the model correctly identified 15 true positives but incorrectly flagged 4 patients (4 false positives). This method does satisfy the 7 true positive quorum.
Hierarchical clustering is an unsupervised learning method that groups patients based on how similar they are to one another, without using diagnosis information. Unlike K-Means, which requires pre-selecting the number of clusters, hierarchical clustering builds a tree-like structure (a dendrogram) that shows how patients merge into clusters at different similarity levels. By examining this structure, we can choose a meaningful number of clusters and understand how distinct or cohesive the groups are. This makes hierarchical clustering useful for exploring natural patterns in patient data and identifying groups that may share similar risk profiles.
This code compares different linkage methods used in hierarchical clustering by calculating their agglomerative coefficients. The agglomerative coefficient (AC) measures how well the data fit the clustering structure, with higher values indicating more cohesive clusters.
#Agglomerative coefficient
Average <- round(agnes(cancer_pca_10, method = "average")$ac, 2)
Complete <- round(agnes(cancer_pca_10, method = "complete")$ac, 2)
Single <- round(agnes(cancer_pca_10,, method = "single")$ac, 2)
Ward <- round(agnes(cancer_pca_10,, method = "ward")$ac, 2)
kable(data.frame(Average, Complete, Single, Ward))| Average | Complete | Single | Ward |
|---|---|---|---|
| 0.86 | 0.91 | 0.77 | 0.96 |
Based on the results, this analysis will use the Ward AC for the strongest clustering structure. Below is the full dendrogram for the training data.
Again, visualizations are used to help determine the optimal number of clusters to use for this analysis.
wssplot <- fviz_nbclust(cancer_pca_10, hcut, method = "wss")
gapplot <- fviz_nbclust(cancer_pca_10, hcut, method = "gap_stat")
silplot <- fviz_nbclust(cancer_pca_10, hcut, method = "silhouette")
plot_grid(wssplot, gapplot, silplot, nrow = 1)Based on the middle and right plots, either 1 or 2 clusters appear optimal, but the first plot justifies the use of 3 clusters. In the first plot, k=3 represents the first clear “elbow,” after which increases in the number of clusters produce only minimal improvement. This makes k=3 a reasonable choice if the goal is to capture additional structure in the data without overfitting.
# Cut dendogram into 2 groups
hc_clusters <- cutree(hc, k = 3)
# Adding rectangles on dendogran to show differen clusters
fviz_dend(hc, k=3, rect = TRUE)Above is the same dendrogram as before but this time colored by the 3 cluster split.
The plot below shows a 2-D projection of the three clusters.
Below is a visualization of the same clusters but colored by cancer diagnosis, with red points signaling a patient with cancer.
We see that most red points fall into clusters 1 and 3, with only one or two appearing in cluster 2. Based on this visual alone, cluster 3 can be interpreted as the high-risk cancer group, cluster 1 as a medium-risk group, and cluster 2 as essentially a non-cancer group.
## 1 2 3
## 0.073170732 0.004854369 1.000000000
This table confirms that cluster 3 is unequivocally a cancer cluster, with 100% of patients in that cluster having cancer. Cluster 1 represents a medium-risk group, with nearly 7.3% of its patients having cancer. Because minimizing false positives is the priority, only clusters 3 will be flagged as cancer-risk groups in the prediction model
Now, a confusion matrix and precision metric will be used to evaluate the hierarchical clustering prediction model.
# creates vector of predicted diagnosis for each observation
predicted <- ifelse(cancer_pca_10_hierarch$cluster == 3, 1, 0)
predicted_f <- factor(predicted, levels = c(0,1))
outcome_f <- factor(cancer_pca_10_hierarch$outcome, levels = c(0,1))
hier_results <- confusionMatrix(predicted_f, outcome_f, positive = "1")
hier_results$table## Reference
## Prediction 0 1
## 0 357 13
## 1 0 8
## Precision
## 1
The hierarchical method predicted cancer in this dataset with a precision of 100%. This means that, among the 8 patients that the model flagged as having cancer, all 8 were diagnosed with cancer (true positives) and there are 0 false positives This model does have a high amount of false negatives but for the purpose of this project, minimizing false positives is the priority and this method does so very effectively. This method does satisfy the 7 true positive quorum.
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a clustering method that groups together data points that are close to one another in dense regions while identifying isolated points as noise or outliers. Instead of forcing every data point into a cluster, DBSCAN focuses on areas where points naturally accumulate. This makes it especially useful when the data contains irregular shapes, varying cluster sizes, or noise. It helps reveal meaningful structure without requiring the number of clusters to be chosen in advance.
With DBSCAN, a wide-accepted rule for minPts argument is at least number of dimensions+1. Because there are 10 principal components, k=11 will be sufficient.
cancer_pca_10_dbscan <- cancer_pca_10
dbscan::kNNdistplot(cancer_pca_10_dbscan, k = 11); abline(h = 6.4, lty = 2)The function above calculates, for each point, the distance to its 11th nearest neighbor and plots these distances in ascending order. This plot is used to identify the radius at which points begin to transition from dense regions to sparser regions. The optimal eps value typically appears at the point where the curve shows a clear bend or “elbow.” In this case, the sharpest elbow occurs at approximately 6.4, indicated by the horizontal reference line, which serves as a reasonable choice for the DBSCAN radius parameter.
set.seed(12345)
db <- fpc::dbscan(cancer_pca_10_dbscan, eps = 6.4, MinPts = 11)
fviz_cluster(db, data = cancer_pca_10_dbscan, stand = FALSE, geom = "point",
ellipse = FALSE, show.clust.cent = FALSE, label = "point", repel = TRUE) + theme_classic()A radius of 6.4 and a minimum of 11 points produces the clustering shown above. Under these parameters, DBSCAN does not form multiple well-separated clusters; instead, it identifies one large primary cluster and a small group of points labeled as outliers. As shown in the table below, 11 data points were classified as outliers.
##
## 0 1
## 11 367
# table of % cancer by clusters
cancer_pca_10_dbscan$clusters <- db$cluster
cancer_pca_10_dbscan$outcome <- cancer_raw$y
cluster_means <- cancer_pca_10_dbscan %>%
group_by(clusters) %>%
summarize(cancer_rate = mean(outcome))
cluster_means## # A tibble: 2 × 2
## clusters cancer_rate
## <dbl> <dbl>
## 1 0 0.727
## 2 1 0.0354
With these parameters, 72.7% of the “outlier” group are cancer patients with only 3.5% cancer patients in the main cluster. In our prediction model, these outlier points will be treated as the predicted cancer cases.
Now, a confusion matrix and precision metric will be used to evaluate the DBSCAN clustering prediction model.
# creates vector of predicted diagnosis for each observation
predicted <- ifelse(cancer_pca_10_dbscan$cluster == 0, 1, 0)
predicted_f <- factor(predicted, levels = c(0,1))
outcome_f <- factor(cancer_pca_10_dbscan$outcome, levels = c(0,1))
dbscan_results <- confusionMatrix(predicted_f, outcome_f, positive = "1")
dbscan_results$table## Reference
## Prediction 0 1
## 0 354 13
## 1 3 8
## Precision
## 0.7272727
The DBSCAN method predicted cancer in this dataset with a precision of 72.7%. This means that, among the 11 patients that the model flagged as having cancer, 8 were actually diagnosed with cancer (true positives) but 3 were incorrectly flagged (false positives). This model also has a high amount of false negatives but for the purpose of this project, minimizing false positives is the priority. This method does satisfy the 7 true positive quorum.
Two different methods of outlier detection will be used and evaluated on ability to predict cancer diagnosis:
K-Nearest Neighbors (KNN) outlier detection identifies unusual patients by comparing each patient’s data to their nearest neighbors. If a patient’s measurements are very different from those of their closest neighbors, they receive a high “outlier score,” indicating a higher likelihood of being atypical. In this context, patients with the highest outlier scores are flagged as potential cancer cases.
cancer_pca_10_knn <- cancer_pca_10
df_knn <- get.knn(data = cancer_pca_10_knn, k = 10)
cancer_pca_10_knn$knnscore <- rowMeans(df_knn$nn.dist)
ggplot(cancer_pca_10_knn) + aes(x=PC1, y=knnscore) +
geom_point(size = 5, alpha = 0.5, color = "darkgreen")This code calculates a KNN outlier score for each patient by averaging the distances to their 10 nearest neighbors. The scatter plot then visualizes these scores along the first principal component, highlighting patients who are most atypical in the dataset. The visual below visualizes the predictions of potential cancer cases with KNN scores above 9 flagged as “cancer”.
Now, a confusion matrix and precision metric will be used to evaluate the KNN outlier detection prediction model.
# creates vector of predicted diagnosis for each observation
predicted_f <- factor(cancer_pca_10_knn$predicted, levels = c(0,1))
outcome_f <- factor(cancer_raw$y, levels = c(0,1))
knn_results <- confusionMatrix(predicted_f, outcome_f, positive = "1")
knn_results$table## Reference
## Prediction 0 1
## 0 355 16
## 1 2 5
## Precision
## 0.7142857
The KNN outlier detection method predicted cancer in this dataset with a precision of 71.4%. This means that, among the 7 patients that the model flagged as having cancer, 5 were actually diagnosed with cancer (true positives) but 2 were incorrectly flagged (false positives). This method does not satisfy the 7 true positive quorum.
Isolation Forest is an unsupervised algorithm that identifies unusual or atypical patients by isolating them from the rest of the data. It works by randomly partitioning the data and measuring how quickly each patient can be separated from others. Patients that are isolated more quickly receive higher anomaly scores, indicating a greater likelihood of being a potential cancer case.
cancer_pca_10_isof <- cancer_pca_10
set.seed(12345)
iso <- isolationForest$new(
sample_size = 256,
num_trees = 100
)
iso$fit(cancer_pca_10_isof)
iso_pred <- iso$predict(cancer_pca_10_isof)
cancer_pca_10_isof$iso_score = iso_pred$anomaly_score
ggplot(cancer_pca_10_isof) + aes(x=iso_score) + geom_density()The density plot above helps determine the best threshold of isolation score to consider points outliers or not. Here we choose 0.67.
cancer_pca_10_isof$predicted <- as.factor(ifelse(cancer_pca_10_isof$iso_score >= 0.67, 1, 0))
ggplot(cancer_pca_10_isof) + aes(x = PC1, y = PC2, color = predicted) + geom_point(size = 5, alpha = 0.5) +
geom_text(aes(label = row.names(cancer_pca_10_isof)), hjust = 1 , vjust = -1 ,size = 3 ) +
theme_minimal()The familiar visual above displays points in PC dimension space but with predictions of potential cancer cases with isolation scores above 0.67 colored blue indicating prediction of cancer.
Now, a confusion matrix and precision metric will be used to evaluate the isolation forest outlier detection prediction model.
# creates vector of predicted diagnosis for each observation
predicted_f <- factor(cancer_pca_10_isof$predicted, levels = c(0,1))
outcome_f <- factor(cancer_raw$y, levels = c(0,1))
isof_results <- confusionMatrix(predicted_f, outcome_f, positive = "1")
isof_results$table## Reference
## Prediction 0 1
## 0 352 19
## 1 5 2
## Precision
## 0.2857143
The isolation forest outlier detection method predicted cancer in this dataset with a precision of 28.6%. This means that, among the 7 patients that the model flagged as having cancer, 2 were actually diagnosed with cancer (true positives) but 5 were incorrectly flagged (false positives). This is by far the worst performing prediction model of the group. This method does not satisfy the 7 true positive quorum.
Below is a confusion matrix for each of the five methods explored in
this analysis: k-means clustering, hierarchical clustering, DBSCAN
clustering, KNN outlier detection, and isolation forest outlier
detection.
Red boxes indicate the rate of False Positives, cases where a patient does not actually has cancer, but the method predicted they do. In this healthcare context, minimizing False Positives is critical given the expensive nature of MM’s cancer treatment trials. Here we see that hierarchical clustering method was the only method that was able to predict with zero false positives.
Below is a table showing each method’s performance in terms of accuracy and sensitivity. Sensitivity is particularly important in this context because it measures the method’s ability to correctly identify patients who actually have cancer, helping to minimize missed diagnoses and reduce the risk of undetected cases.
## method accuracy precision
## 1 KMeans 0.9735450 0.7894737
## 2 Hierarchical 0.9656085 1.0000000
## 3 DBSCAN 0.9576720 0.7272727
## 4 KNN 0.9523810 0.7142857
## 5 IsolationForest 0.9365079 0.2857143
In order to provide more context on content of the table, below is a
visualization of these metrics.
As seen in the table and bar graph above, all methods achieved similar levels of overall accuracy, but their precision varied considerably. The Isolation Forest method performed the worst, with a precision of only 28.6%, indicating it incorrectly flagged many non-cancer cases. K-Means, DBSCAN, and KNN showed moderate precision between 71% and 79%, an improvement over Isolation Forest but still seeing a substantial number of false positive cases. Hierarchical clustering outperformed all other methods, achieving a precision of 100%, meaning it identified zero false positives. This highlights its potential as the most reliable method for identifying ideal patients as candidates for the cancer treatment clinical trial.
Based on the analyses of PCA, K-Means, Hierarchical Clustering, DBSCAN, KNN outlier detection, and Isolation Forest, the patient health data shows promise for optimal selection of clinical trial candidacy. Among the five methods tested, Hierarchical Clustering demonstrated the highest precision of 100%, which is critical in this healthcare context to minimize chances of sending a non-cancerous patient into the expensive clinical trial. While other methods provide useful insights, they are less reliable for correctly identifying cancer cases.
Key Takeaways:
Prioritize Hierarchical Clustering: For operational use, Hierarchical Clustering provides the best balance of precision and interpretability, ensuring all flagged patients are likely to be true cancer cases.
Leverage PCA for dimensionality reduction: Reducing the number of variables while retaining key variance improves clustering stability and interpretability. This is especially important when more health indicator variables that are likely highly correlated with one another are added to the analysis in a real life application.
While the analyses demonstrate that unsupervised learning methods can help identify cancer cases, several limitations should be considered:
No train/test split, reduced generalizability: Because the full dataset was used to ensure at least seven true positive identifications, model performance could not be evaluated on unseen data. As a result, precision estimates may be optimistic, and real-world performance may differ when applied to new patient populations.
Limited variables: The current dataset includes only 30 patient health indicators, which may not capture the full complexity of cancer risk. Real-world applications would likely involve many more features.
Small dataset: With only 378 observations, performance metrics such as sensitivity and accuracy are subject to higher variability. Larger datasets would provide more reliable estimates.
Parameter sensitivity: Method outcomes depend on hyperparameter choices, including the number of clusters, k in KNN, and eps in DBSCAN. Careful tuning is required for robust results.
Dimensionality reduction trade-offs: While PCA helps reduce complexity, it may obscure subtle differences between patients, potentially affecting the detection of nuanced patterns.
Next steps:
Expand dataset and features: Incorporate additional health indicators, longitudinal measurements, and demographic variables to improve clustering accuracy and robustness.
Refine method selection and tuning: Systematically optimize parameters for each method, potentially combining multiple approaches to maximize sensitivity.
Validate on external cohorts: Test the models on independent patient populations to assess generalizability and real-world performance.
Some code used in this analysis is not visible in order to keep the deliverable clean for readability. All code can be found in the RMD file attached. Please direct any and all questions to Whitney Zhang, whitneyz@umich.edu.