FOOTBALL PLAYER POSITION CLASSIFICATION -PYTHON
Have you ever wanted to be able to tell a player’s position just by knowing their key stats like pass completion, tackles won, and expected goals? Probably not, but I have! But most regular football watchers would be able to tell you what position a certain player plays, so what would the use of this model be?
It could be useful for classifying youth players coming through academy systems as many play one position at the academy level and then play a different one in their senior career. So they may for example be a centreback at the youth level, but profile more like a central midfielder based on their stats. In addition, many older players tend to alter position as their career goes on, so if coaches can see that a winger is beginning to profile like a fullback, they could more easily shift a player to a new position. So, being able to identify this early could be useful for clubs and scouts alike.
(N.B. for full code scroll to the bottom of the page, only the key steps will be covered in the main text.)
Data Prepping & Cleaning
This data was obtained at https://www.kaggle.com/datasets/diegobartoli/top5legauesplayers-statsandphys. Data from the 2018/19 season from the top 5 leagues was used. So the first step was to load the data :
#load the data and combine the separate tables
premierLeague2019 = pd.read_csv('2019PremierLeague.csv')
Bundesliga2019 = pd.read_csv('2019Bundesliga.csv')
LaLiga2019 = pd.read_csv('2019LaLiga.csv')
Ligue12019 = pd.read_csv('2019Ligue1.csv')
serieA2019 = pd.read_csv('2019SerieA.csv')
allLeagues = pd.concat((premierLeague2019, Bundesliga2019, LaLiga2019, Ligue12019, serieA2019))
The columns present were:
['_id/$oid' 'name' 'age' 'nationality' 'height' 'weight' 'team' 'position', 'general_stats/games' 'general_stats/time' 'general_stats/red_cards' ,'general_stats/yellow_cards' 'offensive_stats/goals' 'offensive_stats/xG' ,'offensive_stats/assists' 'offensive_stats/xA' 'offensive_stats/shots' ,'offensive_stats/key_passes' 'offensive_stats/npg' 'offensive_stats/npxG' ,'offensive_stats/xGChain' 'offensive_stats/xGBuildup' ,'defensive_stats/Tkl' 'defensive_stats/TklW' 'defensive_stats/Past' ,'defensive_stats/Press' 'defensive_stats/Succ' 'defensive_stats/Blocks' , 'defensive_stats/Int','passing_stats/Cmp' 'passing_stats/Cmp%','passing_stats/1/3' 'passing_stats/PPA' 'passing_stats/CrsPA', 'passing_stats/Prog']
The next step was to look to remove any nulls but none were present:
print(len(allLeagues)) #check length before dropping any values
print(allLeagues.isnull().sum()) #checking the data for nulls
allLeagues = allLeagues.dropna()
print(len(allLeagues)) #no nulls found so none dropped, length did not change
Columns that would not have any predictive baring on the player position were dropped:
print(list(allLeagues.columns.values)) #view all the columns present
#columns not determining position like name, nationality, team, the player id are dropped
allLeagues = allLeagues.drop(columns = ['name', 'nationality', 'team'])
Our variable we want to predict is ‘position’, so we should see how many of each position are actually present:
positions = allLeagues.position.value_counts()
plt.subplots(figsize = (10,10))
positions.plot(kind = 'bar')
plt.title("Top 5 Leagues Position Distribution")
plt.xlabel("Position")
plt.ylabel("Number of Players")
#plt.show()
Look for any duplicate entries:
#current player name id has '$' so we rename it for easier formatting
allLeagues = allLeagues.rename(columns= {'_id/$oid' : 'playerID'})
print(allLeagues.duplicated('playerID').sum()) # there are no duplicate players
Next we want to assess the distribution of number of minutes played:
#distribution of the number of minutes available
plt.hist(allLeagues['general_stats/time'], bins = 10)
plt.title("Player Minutes Played")
plt.xlabel("Number Of Minutes")
plt.ylabel("Number of Players")
plt.show() #normally distributed
#rename to make columns easier to work with as original columsn have '/'
allLeagues = allLeagues.rename(columns= {'general_stats/time' : 'minsPlayed'})
allLeagues = allLeagues.rename(columns= {'general_stats/games' : 'apps'})
But some players have made too few appearances to be considered for analysis, as their small sample size may skew their statistics. So we will only analyse players who appeared in over half the games:
regularPlayers = allLeagues[allLeagues['apps'] > 20] #only want players with a significant number of appearances
allList = list(regularPlayers.columns.values)
print("All list =", allList)
The stats provided in many columns produce a total for the season rather than a per 90 minutes statistic. This means players will be difficult to assess as the number of minutes played will greatly skew their statistics. So, the next step is to adjust stats to be on a per 90 minutes basis:
#we also do not want to adjust percentage stats as they are already an average
nonAdjust = ['playerID', 'age', 'height', 'weight', 'position', 'minsPlayed', 'apps', 'passing_stats/Cmp%']
needToAdjust = []
for x in allList:
if x not in nonAdjust:
needToAdjust.append(x)
#divide these by 90 to get a per game average
for x in needToAdjust:
regularPlayers[x] = (regularPlayers[x]/(regularPlayers['minsPlayed']/90))
Columns which I know through domain knowledge will not be good predictors of position are also dropped:
#remove factors which I don't think will have and impact of player classification
regularPlayers.drop(columns = ['playerID', 'weight', 'apps', 'minsPlayed', 'general_stats/red_cards'], inplace = True)
We need to assess the statistic for collinearity in order to drop statistics which may not be needed.
def correlationFinder(data):
'''
produces a correlation heat map for all the variables. Allows us to identify multi colinearity
'''
corrmat = data.corr()
plt.subplots(figsize=(10,10))
mask = np.triu(np.ones_like(corrmat, dtype= bool))
cmap = sns.diverging_palette(230,20, as_cmap=True)
sns.heatmap(corrmat, annot=True, mask=mask, cmap=cmap)
locs, labels = plt.xticks()
plt.setp(labels, rotation = 30)
plt.show()
correlationFinder(regularPlayers)
Several columns had collinearity of over 0.7, and in these cases, only one of each column was kept. This led to the following columns being dropped:
#potential dropped columns = |#xG vs goals vs npg vs shots npXg | xA vs Assists vs Key Passes | Key passes vs PPA | Tackles vs Tackles won v tackles past v blocks vs interceptions |press vs defensivestatssucc | Progpasses v final third passes
regularPlayers = regularPlayers.drop(columns = ['offensive_stats/goals', 'offensive_stats/npg', 'offensive_stats/npxG', 'offensive_stats/shots', 'offensive_stats/assists', 'offensive_stats/key_passes','passing_stats/PPA', 'defensive_stats/TklW',
'defensive_stats/Past', 'defensive_stats/Blocks', 'defensive_stats/Int','defensive_stats/Succ', 'passing_stats/1/3'])
We need to remove our target variable ‘position’ from the dataframe so that only the remaining features can be used for prediction:
X = regularPlayers.drop(columns = ['position'])
y = regularPlayers.position
Finally, the data is split into training and testing at a ratio of 25:75:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=10)
Model Selection
Confusion Matrices will be needed for all our classifiers so a function is made to do this:
def confusionMAtrixPlotter(dataSource, word):
'''
turns the confustion matrix into and easier to analyse visualisation
'''
ax = sns.heatmap(dataSource, annot = True, cmap= 'Blues')
ax.set_title('Seaborn Confusion Matrix for ' + word)
ax.set_xlabel('\nPredicted Values')
ax.set_ylabel('Actual Values ')
ax.xaxis.set_ticklabels(['DF', 'FW', 'GK', 'MF'])
ax.yaxis.set_ticklabels(['DF', 'FW', 'GK', 'MF'])
plt.show()
Random Forest Classifier:
The first selected classification model used is the random forest tree.
#Random forrest classifier
clf = RandomForestClassifier(n_estimators=100, oob_score=True, max_depth=3, random_state=10)
#fit it to the training data
clf.fit(X_train, y_train)
testPredictionRF = clf.predict(X_test)
print(y_test.value_counts())
RFConfusion = confusion_matrix(y_test, testPredictionRF) #create the confusion matrix
print(classification_report(y_test, testPredictionRF))
confusionMAtrixPlotter(RFConfusion, 'Random Forrest')
SUPPORT VECTOR MACHINE:
#Support Vector Machine
clf2 = svm.SVC(gamma = 'auto', decision_function_shape= "ovo", kernel= 'linear')
clf2.fit(X_train, y_train)
testPreditionSVM = clf2.predict(X_test)
SVMConfusion = confusion_matrix(y_test, testPreditionSVM)
confusionMAtrixPlotter(SVMConfusion, 'SVM')
print("SVM score: ")
print(classification_report(y_test, testPreditionSVM))
NEURAL NETWORK:
To find the best parametres a grid search is needed in param_grid
#NEURAL NETWORK
from sklearn.model_selection import GridSearchCV
random.seed(10)
param_grid = [
{
'activation' : ['identity', 'logistic', 'tanh', 'relu'],
'solver' : ['lbfgs', 'sgd', 'adam'],
'hidden_layer_sizes': [
(1,),(2,),(3,),(4,),(5,),(6,),(7,),(8,),(9,),(10,),(11,), (12,),(13,),(14,),(15,),(16,),(17,),(18,),(19,),(20,),(21,)
]
}
]
#Neural network - initially not converging so into a grid search
#clf3 = GridSearchCV(MLPClassifier(), param_grid, cv=3,scoring='accuracy')
#clf3.fit(X_train, y_train)
#print("Best parameters set found on development set:")
#print(clf3.best_params_)
'''{'activation': 'identity', 'hidden_layer_sizes': (3,), 'solver': 'lbfgs'}'''
clf3 = MLPClassifier(solver = "lbfgs", alpha = 1e-5, activation= "identity", hidden_layer_sizes = (3), random_state=10, learning_rate="adaptive", early_stopping=True, max_iter=1000)
clf3.fit(X_train, y_train)
testPreditionNN = clf3.predict(X_test)
NNConfusionMatrix = confusion_matrix(y_test, testPreditionNN)
confusionMAtrixPlotter(NNConfusionMatrix, 'Neural Netwrok')
print("NEURAL NETWORK SCORE: \n",classification_report(y_test, testPreditionNN))
#logistic regression
clf4 = LogisticRegression(solver='liblinear', random_state = 0)
clf4.fit(X_train, y_train)
testPreditionLR = clf4.predict(X_test)
LRConfusionMatrix = confusion_matrix(y_test, testPreditionLR)
print(LRConfusionMatrix)
confusionMAtrixPlotter(LRConfusionMatrix, 'LR')
print("LOGISTIC REGRESSION SCORE : \n",classification_report(y_test, testPreditionLR))
RESULTS
RANDOM FORREST:
SUPPORT VECTOR MACHINE:
NEURAL NETWORK:
LOGISTIC REGRESSION
RESULTS COMPARISON
Predictably all models identified goalkeepers most easily. The real challenge came in differentiating defenders and midfielders as well as midfielders and forwards. At the end of the day, the SVM was most effective on all metrics, with the most correct classifications, precision, recall as well as the F1-Score.