A Comparison Of Binary Classifcation Methods Using SVM, KNN and Logistic Regression
The purpose of this project is to assess the accuracy of three different classification methods when performing binary classification. The database used was the ‘Pima Indians Diabetes Database’ (available on Kaggle) with the goal being to determine if a person had diabetes or not based on the provided data.
The Python libraries needed for analysis were ‘NumPy’, ‘pandas’, ‘matplotlib’, ‘seaborn’, and ‘SKLearn’. The three selected classification methods were K-Nearest Neighbours (KNN), Logistic Regression, and Support Vector Machine (SVM). Each of these methods had different parameters associated with it and the exact methods for how they were found are included in the python file ‘ParameterSelectionDiabetes.py’ provided with this report. In short, the parameters obtained are as follows:
The report of my findings will be provided first, followed by the code I used during this testing process.
SVM
A grid search was performed to find the optimal parameters to pass to the SVM. The grid search used all different combinations of ‘C’, ‘gamma’ and ‘kernel’ until the combination with the highest combination was selected. The options entered into the grid search were: paramGrid = {'C': [0.1,1, 10, 100, 1000], 'gamma': [1,0.1,0.01,0.001,0.0001], 'kernel': ['rbf','linear']} The best parameters which were found after this procedure were: SVC(C=1000, gamma = 1, kernel = ‘linear’).
LOGISTIC REGRESSION
To search for the optimal parameters for the logistic regression, three lists were initially created: 1. solverList = ['newton-cg', 'lbfgs', 'liblinear'] 2. cList = [.001, .01, .1, 1, 10, 100, 1000] 3. maxIterList = [300, 400]* Logistic regression was then run on every possible combination of the three lists until the one with the highest accuracy was returned. As a result, the parameters used for the logistic regression were: LogisticRegression(solver='lbfgs', max_iter=300, C=1)
KNN
To obtain parameters for the KNN, a list of all possible metrics ('euclidean', 'manhattan', 'minkowski') and all possible ‘K’ values from 1-41 was created. All possible combinations of these two lists were iterated upon, until the option with the highest accuracy was returned. The different accuracies are shown in fig. 1.
Through examination of the graphs, K=23 provided the highest accuracy. The Euclidian and Minkowski metrics performed similarly but the Euclidean was chosen as it was a more familiar metric. The final parameters were:
KNeighborsClassifier(n_neighbors=23, metric='euclidean')
TRAINING AND TESTING PROCESS
Initially, there were 768 observations with 268 of them being positive for diabetes while 500 were negative. After loading and inspecting the data, the next step was cleaning the database. Some observations such as BMI, blood pressure and Glucose only had a few observations that were equal to zero and as a result, these rows could be dropped without losing too much data. However, the features ‘SkinThickness’ and ‘Insulin’ contained too many zero values to simply drop, so two methods were attempted. Initially, all zero values for these two features were replaced by the mean for the feature, however, this led to reduced prediction accuracy as the volume of zeros affected the mean. So instead, all zeros for each feature were instead replaced by their median value. With the now clean data, 75% of the data was used for training purposes whilst 25% was used as testing data after the database had been shuffled. No cross-validation was used to due to the small size of the dataset.
CODE:
The process for initially cleaning the data was the same with the main differences coming after the data has been cleaned.
CLEANING CODE:
KNN:
The below code is added to the above script:
Logistic Regression:
The below code is added to the baseline script:
Support Vector Machine (SVM):
The below code is added to the baseline script:
Parameter Selection:
Before implementing the above code, we had to test several parameters in order to establish the best parameters for our mode.
The script to do this is as follows:
Evaluation of models:
The confusion matrices for each of our classifications are as follows:
An important thing to note is the ratio of 0:1 classes in the overall data set. For the overall dataset, the ratio was 475:249 for zeros to ones respectively. For the test data, the ratio was 120:61 for zeros to ones respectively. Approximately 2/3 of the test set is class 0. Therefore, if we simply wrote a function that always returned class 0 we would have an accuracy of 0.67. This partially explains why class 0’s Precision, Recall, and F-score outperform class 1’s in all 3 of the models. This ratio of 2:1 could be considered a slight imbalance. From that we can see the recall for class 1 was particularly poor for all 3 models which may well be due to the imbalance in the data.
If, as above, we wrote a program that always guessed class 0 we would get an f-score for class 0 of 0.8 and 0 for class 1 which shows that our model outperforms this most basic model.
If we went further and wrote a program based on the composition of the data set and randomly guessed class 0 66% of the time we would get a f-score for class 0 of 0.67 and 0.33 for class 1. Which allows us to infer that the imbalance is not skewing our models too badly.
In terms of accuracy, the KNN model (85%) is first, followed by Logistic Regression (81%) and SVM (80%). However Logistic and SVM are so close that it is conceivable that a different random state on the split might cause these 2 to swap places in the ranking. KNN as the clear winner makes sense after looking at the correlation heatmap (Figure 5).
The inclusion of such low correlated features as ‘SkinThickness’ and ‘Insulin’ would suggest a reason that Logistic regression underperformed as it struggles with these low correlated features. The three strongest correlations are Glucose, BMI, and Pregnancies. Glucose is plotted against BMI as well as a 3D scatter of all three while colour coding each point by Outcome in Figure 6.
This shows that although there is certainly a trend, there is enough noise that might cause an issue in the linear separation of the data that would then impact the performance of the SVM. Another issue with the SVM is the sheer number of possible hyperparameters, which that meant although an extensive grid-search was performed, we lacked the required computational power to try them all whilst simultaneously choosing the appropriate kernel function. It is also worth noting that in ‘ParameterSelectionDiabetes.py’ the grid search including the linear kernel was both time and computer intensive.