FOOTBALL PLAYER POSITION CLASSIFICATION -PYTHON

Have you ever wanted to be able to tell a player’s position just by knowing their key stats like pass completion, tackles won, and expected goals? Probably not, but I have! But most regular football watchers would be able to tell you what position a certain player plays, so what would the use of this model be?

It could be useful for classifying youth players coming through academy systems as many play one position at the academy level and then play a different one in their senior career. So they may for example be a centreback at the youth level, but profile more like a central midfielder based on their stats. In addition, many older players tend to alter position as their career goes on, so if coaches can see that a winger is beginning to profile like a fullback, they could more easily shift a player to a new position. So, being able to identify this early could be useful for clubs and scouts alike.

(N.B. for full code scroll to the bottom of the page, only the key steps will be covered in the main text.)

Data Prepping & Cleaning

This data was obtained at https://www.kaggle.com/datasets/diegobartoli/top5legauesplayers-statsandphys. Data from the 2018/19 season from the top 5 leagues was used. So the first step was to load the data :

#load the data and combine the separate tables

premierLeague2019 = pd.read_csv('2019PremierLeague.csv')

Bundesliga2019 = pd.read_csv('2019Bundesliga.csv')

LaLiga2019 = pd.read_csv('2019LaLiga.csv')

Ligue12019 = pd.read_csv('2019Ligue1.csv')

serieA2019 = pd.read_csv('2019SerieA.csv')

allLeagues = pd.concat((premierLeague2019, Bundesliga2019, LaLiga2019, Ligue12019, serieA2019))

 

The columns present were:

['_id/$oid' 'name' 'age' 'nationality' 'height' 'weight' 'team' 'position', 'general_stats/games' 'general_stats/time' 'general_stats/red_cards'      ,'general_stats/yellow_cards' 'offensive_stats/goals' 'offensive_stats/xG' ,'offensive_stats/assists' 'offensive_stats/xA' 'offensive_stats/shots'    ,'offensive_stats/key_passes' 'offensive_stats/npg' 'offensive_stats/npxG' ,'offensive_stats/xGChain' 'offensive_stats/xGBuildup' ,'defensive_stats/Tkl' 'defensive_stats/TklW' 'defensive_stats/Past'      ,'defensive_stats/Press' 'defensive_stats/Succ' 'defensive_stats/Blocks'  , 'defensive_stats/Int','passing_stats/Cmp' 'passing_stats/Cmp%','passing_stats/1/3' 'passing_stats/PPA' 'passing_stats/CrsPA', 'passing_stats/Prog']

 

The next step was to look to remove any nulls but none were present:

print(len(allLeagues)) #check length before dropping any values

print(allLeagues.isnull().sum()) #checking the data for nulls

allLeagues = allLeagues.dropna()

print(len(allLeagues)) #no nulls found so none dropped, length did not change

 

Columns that would not have any predictive baring on the player position were dropped:

print(list(allLeagues.columns.values)) #view all the columns present

#columns not determining position like name, nationality, team, the player id are dropped

allLeagues = allLeagues.drop(columns = ['name', 'nationality', 'team'])

 

Our variable we want to predict is ‘position’, so we should see how many of each position are actually present:

positions = allLeagues.position.value_counts()

plt.subplots(figsize = (10,10))

positions.plot(kind = 'bar')

plt.title("Top 5 Leagues Position Distribution")

plt.xlabel("Position")

plt.ylabel("Number of Players")

#plt.show()

Look for any duplicate entries:

#current player name id has '$' so we rename it for easier formatting

allLeagues = allLeagues.rename(columns= {'_id/$oid' : 'playerID'})

print(allLeagues.duplicated('playerID').sum()) # there are no duplicate players

 

Next we want to assess the distribution of number of minutes played:

#distribution of the number of minutes available

plt.hist(allLeagues['general_stats/time'], bins = 10)

plt.title("Player Minutes Played")

plt.xlabel("Number Of Minutes")

plt.ylabel("Number of Players")

plt.show() #normally distributed

#rename to make columns easier to work with as original columsn have '/'

allLeagues = allLeagues.rename(columns= {'general_stats/time' : 'minsPlayed'})

allLeagues = allLeagues.rename(columns= {'general_stats/games' : 'apps'})

 

But some players have made too few appearances to be considered for analysis, as their small sample size may skew their statistics. So we will only analyse players who appeared in over half the games:

regularPlayers = allLeagues[allLeagues['apps'] > 20] #only want players with a significant number of appearances

allList = list(regularPlayers.columns.values)

print("All list =", allList)

 

The stats provided in many columns produce a total for the season rather than a per 90 minutes statistic. This means players will be difficult to assess as the number of minutes played will greatly skew their statistics. So, the next step is to adjust stats to be on a per 90 minutes basis:

#we also do not want to adjust percentage stats as they are already an average

nonAdjust = ['playerID', 'age', 'height', 'weight', 'position', 'minsPlayed', 'apps', 'passing_stats/Cmp%']

needToAdjust = []

for x in allList:

    if x not in nonAdjust:

        needToAdjust.append(x)

 

#divide these by 90 to get a per game average

for x in needToAdjust:

   regularPlayers[x] = (regularPlayers[x]/(regularPlayers['minsPlayed']/90))

 

Columns which I know through domain knowledge will not be good predictors of position are also dropped:

#remove factors which I don't think will have and impact of player classification

regularPlayers.drop(columns = ['playerID', 'weight', 'apps', 'minsPlayed', 'general_stats/red_cards'], inplace = True)

 

 

We need to assess the statistic for collinearity in order to drop statistics which may not be needed.

def correlationFinder(data):

    '''

    produces a correlation heat map for all the variables. Allows us to identify multi colinearity

    '''

    corrmat = data.corr()

    plt.subplots(figsize=(10,10))

    mask = np.triu(np.ones_like(corrmat, dtype= bool))

    cmap = sns.diverging_palette(230,20, as_cmap=True)                                                                                                                                                      

    sns.heatmap(corrmat, annot=True, mask=mask, cmap=cmap)

    locs, labels = plt.xticks()

    plt.setp(labels, rotation = 30)

    plt.show()

correlationFinder(regularPlayers)

Several columns had collinearity of over 0.7, and in these cases, only one of each column was kept. This led to the following columns being dropped:

#potential dropped columns = |#xG vs goals vs npg vs shots npXg | xA vs Assists vs Key Passes | Key passes vs PPA | Tackles vs Tackles won v tackles past v blocks vs interceptions |press vs defensivestatssucc | Progpasses v final third passes  

regularPlayers = regularPlayers.drop(columns = ['offensive_stats/goals', 'offensive_stats/npg', 'offensive_stats/npxG', 'offensive_stats/shots', 'offensive_stats/assists', 'offensive_stats/key_passes','passing_stats/PPA', 'defensive_stats/TklW',

'defensive_stats/Past', 'defensive_stats/Blocks', 'defensive_stats/Int','defensive_stats/Succ', 'passing_stats/1/3'])

 

We need to remove our target variable ‘position’ from the dataframe so that only the remaining features can be used for prediction:

X = regularPlayers.drop(columns = ['position'])

y = regularPlayers.position

 

Finally, the data is split into training and testing at a ratio of 25:75:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=10)

Model Selection

Confusion Matrices will be needed for all our classifiers so a function is made to do this:

def confusionMAtrixPlotter(dataSource, word):

    '''

    turns the confustion matrix into and easier to analyse visualisation

    '''

    ax = sns.heatmap(dataSource, annot = True, cmap= 'Blues')

    ax.set_title('Seaborn Confusion Matrix for ' + word)

    ax.set_xlabel('\nPredicted Values')

    ax.set_ylabel('Actual Values ')

    ax.xaxis.set_ticklabels(['DF', 'FW', 'GK', 'MF'])

    ax.yaxis.set_ticklabels(['DF', 'FW', 'GK', 'MF'])

    plt.show()

 

 

Random Forest Classifier:

The first selected classification model used is the random forest tree.

#Random forrest classifier

clf = RandomForestClassifier(n_estimators=100, oob_score=True, max_depth=3, random_state=10)

#fit it to the training data

clf.fit(X_train, y_train)

testPredictionRF = clf.predict(X_test)

print(y_test.value_counts())

RFConfusion = confusion_matrix(y_test, testPredictionRF) #create the confusion matrix

print(classification_report(y_test, testPredictionRF))

confusionMAtrixPlotter(RFConfusion, 'Random Forrest')

 

 

SUPPORT VECTOR MACHINE:

#Support Vector Machine

clf2 = svm.SVC(gamma = 'auto', decision_function_shape= "ovo", kernel= 'linear')

clf2.fit(X_train, y_train)

testPreditionSVM = clf2.predict(X_test)

SVMConfusion = confusion_matrix(y_test, testPreditionSVM)

confusionMAtrixPlotter(SVMConfusion, 'SVM')

print("SVM score: ")

print(classification_report(y_test, testPreditionSVM))

 

NEURAL NETWORK:

To find the best parametres a grid search is needed in param_grid

#NEURAL NETWORK

from sklearn.model_selection import GridSearchCV

random.seed(10)

param_grid = [

        {

            'activation' : ['identity', 'logistic', 'tanh', 'relu'],

            'solver' : ['lbfgs', 'sgd', 'adam'],

            'hidden_layer_sizes': [

             (1,),(2,),(3,),(4,),(5,),(6,),(7,),(8,),(9,),(10,),(11,), (12,),(13,),(14,),(15,),(16,),(17,),(18,),(19,),(20,),(21,)

             ]

        }

       ]

#Neural network - initially not converging so into a grid search

#clf3 = GridSearchCV(MLPClassifier(), param_grid, cv=3,scoring='accuracy')

#clf3.fit(X_train, y_train)

#print("Best parameters set found on development set:")

#print(clf3.best_params_)

'''{'activation': 'identity', 'hidden_layer_sizes': (3,), 'solver': 'lbfgs'}'''

clf3 = MLPClassifier(solver = "lbfgs", alpha = 1e-5, activation= "identity", hidden_layer_sizes = (3), random_state=10, learning_rate="adaptive", early_stopping=True, max_iter=1000)

clf3.fit(X_train, y_train)

testPreditionNN = clf3.predict(X_test)

NNConfusionMatrix = confusion_matrix(y_test, testPreditionNN)

confusionMAtrixPlotter(NNConfusionMatrix, 'Neural Netwrok')

print("NEURAL NETWORK SCORE: \n",classification_report(y_test, testPreditionNN))

 

#logistic regression

clf4 = LogisticRegression(solver='liblinear', random_state = 0)

clf4.fit(X_train, y_train)

testPreditionLR = clf4.predict(X_test)

LRConfusionMatrix = confusion_matrix(y_test, testPreditionLR)

print(LRConfusionMatrix)

confusionMAtrixPlotter(LRConfusionMatrix, 'LR')

print("LOGISTIC REGRESSION SCORE : \n",classification_report(y_test, testPreditionLR))

RESULTS

RANDOM FORREST:

SUPPORT VECTOR MACHINE:

NEURAL NETWORK:

LOGISTIC REGRESSION

RESULTS COMPARISON

Predictably all models identified goalkeepers most easily. The real challenge came in differentiating defenders and midfielders as well as midfielders and forwards. At the end of the day, the SVM was most effective on all metrics, with the most correct classifications, precision, recall as well as the F1-Score.

Previous
Previous

Binary Perceptron Classifier Using Only Pandas & Numpy

Next
Next

Bank Account Classes Using Python