Implementing KNN and K-Medians From Scratch In Python
The K- nearest neighnours (KNN) and K-medians are two of the most popular clustering algorithms available and both of these algorithms group input based on similarity which is useful in various fields such as businesses looking to do customer segmentation, recommendation systems, fraud detection and so much more. The goal of this project was to implement these algorithms from scratch using only numpy and pandas. The inputs to be clustered were various words randomly assigned to four different catergories “animals”, “countries”, “fruits”, and “veggies”. Each word is followed by 300 features describing the meaning of that word (Word Embedding). By implementing a KNN using both Manhattan and Euclidean as well as testing values of “k”, we will be able to hopefully obtain clusters of words that are correctly assigned to each category.
Now having plotted both K-Means and K-Medians for various values of K, we can plot accuracy, F- score etc to assess what the most accurate value of K is.
To assess which of the algorithms produced the best result, we first need to know what precision, recall, and F-score are. Precision is simply the number of correctly classified classes in comparison to the total classifications of that case. Precision is more important for example when deciding who to show a targeted advert to, as one would want to be sure that the person the advert is shown to would be interested in it, and false positives would be a wasted cost. To clarify, this is:
Precision: True Positives (TP)/ TP + False Positives (FP).
Recall, on the other hand, is the amount of total true positives that were correctly classified, regardless of the number of false negatives. Recall is more important when the consequences of a false negative are graver than that of a false negative. An easy example is cancer diagnosis, it is better to detect all true positives and mistakenly misdiagnose a few patients as having cancers, rather than never misdiagnosing a patient as having cancer, but then not detecting a few patients who actually had cancer. The formula is:
Recall: TP/(TP+FN)
F-Score considers both measures and finds their harmonic means and as a result, uses a balance of both precision and recall. The formula for the F-Score is:
F-Score = (2*Precision*Recall)/ (Precision + Recall)
When running my algorithms, each iteration resulted in differing graphs. So for the sake of reproducibility, the algorithms were initiated with a seen of ten (10) resulting in the graphs in fig 1., fig 2., fig 3, and fig 4. Generally, all four graphs followed a similar trend. Recall would always begin at 1 at k = 1, as the cluster would be assigned the truth label of countries as it had the highest count. All 161 countries would be in the cluster resulting in a recall of 100. The precision would also always be low as all data points being in the same cluster means there will naturally be a lot of false positives.
For all four algorithms, the B-Cubed measures tended to converge at k = 4 which makes sense as there were four true categories within our data. As a result, there would be the optimal intersection of precision (as there are enough clusters for each true category to have its own) and recall falls from 1 as there begins to be a few miscalculations, however, it is still fairly high. Table 5 (below) shows the optimal measures were found at K = 4 using the K medians algorithm on unnormalized data with the measures of precision, recall, and F-Score all being above 90%. The reason k-medians may have provided a better score is the K-means algorithms are sensitive to outliers, so a single outlier can completely skew the mean and the centroid as a result leading to worse clustering. Using the median is more robust against outliers and as a result may provide better centroids throughout the iterations.