Part 2 | Building Model

Exploratory data analysis, feature engineering and model selection

Author

Farid Musayev

Published

September 6, 2022

Package prerequisites

In this part, the following packages need to be imported:

Code

# Preprocessing
import ast
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
from matplotlib import cm
from matplotlib.lines import Line2D
from mplsoccer import Pitch, VerticalPitch
import seaborn as sns
%config InlineBackend.figure_formats = ['svg']

# Data Splitting and Transformation
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler

# Statistical Distributions
from scipy.stats import uniform, randint

# Modeling
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier 
import statsmodels.formula.api as smf

# Model Evaluation
from sklearn.metrics import roc_auc_score, brier_score_loss

Exploratory Data Analysis

Firstly, I read the preprocessed .csv data file and convert all required columns to python readable object types.

Code

# read shots.csv file 
shots = pd.read_csv('.data/shots.csv')
shots.loc[:, 'freeze_frame'] = shots.loc[:, 'freeze_frame'].apply(ast.literal_eval)
shots.loc[:, 'gk_loc'] = shots.loc[:, 'gk_loc'].apply(ast.literal_eval)
shots.loc[:, 'end_loc'] = shots.loc[:, 'end_loc'].apply(ast.literal_eval)

Prediction of a goal outcome in soccer is a binary classification task { 0 - No goal ; 1 - Goal }. However, the key point is that, in xG model, we are not dealing with hard classes but rather trying to make a probabilistic prediction for a shot outcome. In comparison with hard classes, probabilistic outputs allow to describe the quality of a shot since not all shots are equally probable to be scored. In other words, given a shot, how likely it is to result in a goal. This is what the xG value estimates for a given shot.

The code snippet below shows that there are different outcome types for a given shot such as Saved, Off target, shot that hit Post, Blocked, way off target Wayward shot, etc. To build a binary probabilistic classifier, it is necessary to define predictions as hard classes. Here, I convert each value of outcome_type column to 1 for Goal scenario or 0 for the rest of scenarios.

Code

shots.loc[:, 'outcome'].unique()

array(['Saved', 'Off T', 'Post', 'Goal', 'Blocked', 'Wayward',
       'Saved Off Target', 'Saved to Post'], dtype=object)

Code

# rename existing 'outcome' column to 'outcome_type' 
shots = shots.rename(columns = {'outcome': 'outcome_type'})
# save binary results into a newly created 'outcome' column
shots.loc[:, 'outcome'] = shots.loc[:, 'outcome_type'].apply(lambda x: 1 if x == 'Goal' else 0)
shots.loc[:, 'outcome']

0        0
1        0
2        0
3        0
4        0
        ..
11038    0
11039    0
11040    0
11041    0
11042    0
Name: outcome, Length: 11043, dtype: int64

Now, let us analyze the types of available shots and their frequencies. From Figure 1, it can be seen that our dataframe contains data on 11043 shots. Below you can see that 1165 of them resulted in a goal.

Code

# Data preparation
shot_types = pd.DataFrame(shots.loc[:, 'outcome_type'].value_counts()).reset_index()
shot_types.columns = ['outcome_type', 'n']
shot_types = shot_types.sort_values(by = 'n', ascending = True)

# Canvas
fig, ax = plt.subplots(figsize = (5, 5))

# grid specs
ax.grid(color = 'black', ls = '-.', lw = 0.25, axis = "x")

# Main plot
pl = ax.barh(shot_types["outcome_type"], shot_types["n"], height = 0.6, label = 'n',
       color = 'skyblue', edgecolor = "black", zorder = 2, alpha = 0.7) 

# Barplot labels
ax.bar_label(pl, padding = 5, label_type='edge')

# Labels and titles
ax.set_ylabel("Outcome Type", fontsize = 10)
ax.set_xlabel("# of instances", fontsize = 10)
ax.tick_params(axis = 'both', which = 'major', labelsize = 10)
ax.spines[['top', 'right']].set_visible(False)

plt.show()

Figure 1: Distribution of shot outcomes across female soccer competitions.

To sum up, we can see that a majority of shots are off target, blocked or saved. Since only 1165 out of 11043 shots are goals, we can conclude that our data is imbalanced. This will affect our choice of model evaluation metric in the model selection phase.

Next, we can visualize the location of all shots.

Code

# Canvas
pitch = Pitch(pitch_type = 'statsbomb')  
fig, ax = pitch.draw(figsize=(6, 8))

# Plot
sns.scatterplot(data = shots, x = 'x_start', y = 'y_start', ax = ax,
                hue = 'outcome', palette = 'seismic', edgecolor = 'black', alpha = 0.4)

# legend design
custom = [Line2D([], [], marker = '.', color = 'b', linestyle = 'None'),
          Line2D([], [], marker = '.', color = 'r', linestyle = 'None')]

plt.legend(custom, ['No Goal', 'Goal'], bbox_to_anchor = (0.05, 0.21))

# Arrow design
arrow_ax = fig.add_axes([0.28, 0.22, 0.35, 0.3]) # X, Y, width, height

arrow_ax.arrow(0.45, 0.1, 0.30, 0, head_width = 0.03, head_length = 0.03, linewidth = 4, 
           color = 'darkgrey', length_includes_head = True)
arrow_ax.set_ylim(0, 1)
arrow_ax.set_xlim(0, 1)
arrow_ax.set_axis_off()
arrow_ax.annotate('Direction of Play', xy = (0.42, 0.02), fontsize = 10)


plt.show()

Figure 2: Distribution of shots according to their coordinates.

A majority of shots is made in the central block of the final third area. In addition, there are several outlying shots made from a central area and flang positions. On the right flang, some of outliers even resulted in a goal.

Feature Engineering

When it comes to the analysis of a shot made by a player, one can even intuitively predict whether that shot will have a decent outcome or not. In practice, two major factors drive this intuition and can be actually quantified. These are distance to a goal and angle under which a shot is taken.

Code

# Pitch design
pitch = VerticalPitch(pitch_type = 'statsbomb',
                      half = True, 
                      pad_left = 0, pad_right = 0, pad_top = 0, pad_bottom = 0.15)  
# Canvas
fig, ax = pitch.draw(nrows = 1, ncols = 3, figsize = (8, 10))

# Data preparation
x = np.array([100, 120, 105, 120, 110, 120]).reshape(3, 2)
y = np.array([20, 40, 50, 40, 40, 40]).reshape(3, 2)
for i in range(3):
    # Plot
    pitch.goal_angle(x[i][0], y[i][0], ax = ax[i], alpha = 0.4, color = 'skyblue')
    pitch.lines(x[i][0], y[i][0], x[i][1], y[i][1], ax = ax[i], linewidth = 1)


plt.show()

Figure 3: Different distances to a goal (solid line) and angles (shaded area) for a given shot.

Distance and Angle Features

To demonstrate an impact of distance and angle features on the probability of a shot to result in a goal, I evaluate these features from a given (x, y) coordinate of each shot and build a simple logistic regression that makes probabilitistic predictions.

To evaluate the distance to a goal, I calculate the Euclidean distance between (x, y) coordinate of a shot and the goal centerline. Since I work with Statsbomb data, I use their pitch dimensions, which are [0, 120] on the x axis and [0, 80] on the y axis. Thus, the goal centerline coordinates are (120, 40).

Below you can see my implementation:

# Distance Feature calculation

# define goal center for 'statsbomb'
goal_center = np.array([120, 40])

# calculate distance between a shot coordinate and goal centerline coordinate
shots['distance'] = np.sqrt((shots['x_start'] - goal_center[0])**2 + (shots['y_start'] - goal_center[1])**2)
shots['distance'] = shots['distance'].round(decimals = 2)

Next, I calculate the angle of a shot. The task breaks down to finding the angle between two sides of a triangle given that all lengths (a, b, c) of the triangle are known.

Below you can see my implementation:

# Angle Feature calculation

# transform (x, y) coordinates from percentiles to field length coordinates (105 meters x 68 meters)
x = shots['x_start'] * 105/120
y = shots['y_start'] * 68/80 

# Use trigonometric formula to find an angle between two sides (a,b) of a triangle where the third side (c) 
# is a goal line of length 7.32 meters.
a = np.sqrt((x - 105)**2 + (y - 30.34)**2) # length between right post and (x, y) shot coordinate
b = np.sqrt((x - 105)**2 + (y - 37.66)**2) # length between left post and (x, y) shot coordinate
c = 7.32 # goal line length in meters
cos_alpha = (a**2 + b**2 - c**2)/(2*a*b)
cos_alpha = np.round(cos_alpha, decimals = 4)

# remember to leave angle in radians (if you want to transfer to degree multiply cos_alpha by 180/pi)
shots['angle'] = np.arccos(cos_alpha)

Now, I would like to demonstrate how both of these features impact the probability of scoring.

I run a simple logistic regression that includes only these features (distance, angle) and obtain probabilisitc predictions for each shot. Then, I plot both of these features against the probabilistic predictions to visualize the relationship. The objective is to illustrate the relationship between the features and the probability of scoring.

Code

# Prepare features and labels from available data
X = shots.loc[:, ['distance', 'angle']]
y = shots.loc[:, 'outcome']

# Fit Logistic Regression Model
classifier = LogisticRegression()
classifier.fit(X, y)

# make predictions
predictions = classifier.predict_proba(X)[:, 1]


# Canvas
fig, ax = plt.subplots(nrows = 1, ncols = 2, figsize = (10, 4))

# Distance plot design

# grid
ax[0].grid(color = 'black', ls = '-.', lw = 0.25, axis = "both")

# plot
ax[0].scatter(X['distance'], predictions, color = 'gray', s = 0.5, alpha = 0.4)
ax[0].set_xlabel('Distance')
ax[0].set_ylabel('Probability of scoring')

# axis adjustments
ax[0].set_ylim(0, 0.8)
ax[0].set_xlim(0, 90)
ax[0].yaxis.get_major_ticks()[0].label1.set_visible(False)
ax[0].tick_params(length = 0)

############################################

# Angle plot design

# grid
ax[1].grid(color = 'black', ls = '-.', lw = 0.25, axis = "both")

# plot
ax[1].scatter(X['angle'], predictions, color = 'orange', s = 0.5, alpha = 0.4)
ax[1].set_xlabel('Angle')
ax[1].set_ylabel('Probability of scoring')

# axis adjustments
ax[1].set_ylim(0, 0.8)
ax[1].set_xlim(0, 3.5)
ax[1].yaxis.get_major_ticks()[0].label1.set_visible(False)
ax[1].tick_params(length = 0)

ax[0].text(x = 44, y = -0.2, s = 'a)', fontsize = 12)
ax[1].text(x = 1.72, y = -0.2, s = 'b)', fontsize = 12)


plt.show()

Figure 4: Probability of scoring decreases with increasing distance (a) and increases with increasing angle (b).

As can be seen from a) part of Figure 4, the inprobability of scoring (or you can also call it xG value) decreases exponentially with increasing distance. On the contrast, the probability of scoring increases linearly with angle.

In both plots of Figure 4, there are densely populated parts that can be analyzed in a slightly different way. From these areas a majority of shots is executed. When analyzing Figure 5, we can observe that most of the shots are executed within the distance range from 5 to 30 m. Most of the angles of the executed shots are distributed within the angle range from 0 to 60 degrees (or from 0 to 1 radians, respectively).

Code

# Canvas
fig, ax = plt.subplots(nrows = 1, ncols = 2, figsize = (10, 4))

# Distance density plot design
ax[0].grid(color = 'black', ls = '-.', lw = 0.25, axis = "both")
sns.kdeplot(x = 'distance', data = shots, ax = ax[0], color = 'gray')
ax[0].set_xlabel('Distance')
ax[0].set_ylim(0, 0.045)
ax[0].set_yticks(np.arange(0, 0.045, 0.01))
ax[0].set_xlim(-10, 100)


# Angle density plot design
ax[1].grid(color = 'black', ls = '-.', lw = 0.25, axis = "both")
sns.kdeplot(x = 'angle', data = shots, ax = ax[1], color = 'orange')
ax[1].set_xlabel('Angle')
ax[1].set_ylim(0, 2.8)

ax[0].text(x = 44, y = -0.01, s = 'a)', fontsize = 12)
ax[1].text(x = 1.5, y = -0.63, s = 'b)', fontsize = 12)


plt.show()

Figure 5: Distribution of distances (a) and angles (b) for all executed shots.

Statistical Analysis

All features are included into the logistic regression model to analyze their statistical signficance. The objective is to analyze p-values for each of the features and determine which of these values are weakly associated with the response.

Code

# Data Preparation
X = shots.loc[:, ['play_pattern_name','under_pressure', 'distance', 'angle', 'gk_loc_x', 'gk_loc_y',
                   'follows_dribble', 'first_time', 'open_goal', 'technique', 'body_part']]
y = shots.loc[:, 'outcome']

df_train = pd.concat([X, y], axis = 1).reset_index(drop = True)

# run model
logreg_model = smf.logit(
    formula = "outcome ~ distance + angle + under_pressure + gk_loc_y + gk_loc_x + \
    body_part + open_goal + play_pattern_name",
                         data = df_train).fit()

# Extract p-values
pd.DataFrame(logreg_model.pvalues, columns = ['p-value']).round(decimals = 3)

Optimization terminated successfully.
         Current function value: 0.284721
         Iterations 8

	p-value
Intercept	0.000
body_part[T.Left Foot]	0.000
body_part[T.Other]	0.642
body_part[T.Right Foot]	0.000
play_pattern_name[T.From Counter]	0.001
play_pattern_name[T.From Free Kick]	0.000
play_pattern_name[T.From Goal Kick]	0.000
play_pattern_name[T.From Keeper]	0.203
play_pattern_name[T.From Kick Off]	0.239
play_pattern_name[T.From Throw In]	0.001
play_pattern_name[T.Other]	0.252
play_pattern_name[T.Regular Play]	0.000
distance	0.000
angle	0.000
under_pressure	0.056
gk_loc_y	0.431
gk_loc_x	0.000
open_goal	0.000

There are several categorical variables with very high p-values. These are body_part[T.Other], play_pattern_name[T.From Keeper], play_pattern_name[T.From Kick Off], play_pattern_name[T.Other] and gk_loc_y. Let us analyze these features and decide if we can drop them from the model.

First, play_pattern_name column describes different types of play during which a shot was executed. Below code snippet shows that there are 9 types of play in total.

Code

shots.loc[:, 'play_pattern_name'].unique()

array(['Regular Play', 'From Free Kick', 'From Throw In', 'From Counter',
       'From Corner', 'From Keeper', 'From Goal Kick', 'From Kick Off',
       'Other'], dtype=object)

Code

# Data preparation for barplot 1
play_types = pd.DataFrame(shots.loc[:, 'play_pattern_name'].value_counts()).reset_index()
play_types.columns = ['play_pattern_name', 'n']
play_types = play_types.sort_values(by = 'n', ascending = True)

# Data preparation for barplot 2
body_part = pd.DataFrame(shots.loc[:, 'body_part'].value_counts()).reset_index()
body_part.columns = ['body_part', 'n']
body_part = body_part.sort_values(by = 'n', ascending = True)

# Canvas
fig, ax = plt.subplots(nrows = 1, ncols = 2, figsize = (4, 4))

# Grid specs
ax[0].grid(color = 'black', ls = '-.', lw = 0.25, axis = "x")

# Main plot
pl = ax[0].barh(play_types["play_pattern_name"], play_types["n"], height = 0.6, label = 'n',
       color = 'skyblue', edgecolor = "black", zorder = 2, alpha = 0.7) 

# Barplot labels
ax[0].bar_label(pl, padding = 5, label_type='edge', fontsize = 8)

# Labels and titles
ax[0].set_ylabel("Type of Play", fontsize = 10)
ax[0].set_xlabel("# of instances", fontsize = 10)
ax[0].tick_params(axis = 'both', which = 'major', labelsize = 8)
ax[0].spines[['top', 'right']].set_visible(False)

#############################################################################

# Grid specs
ax[1].grid(color = 'black', ls = '-.', lw = 0.25, axis = "x")

# Main plot
pl2 = ax[1].barh(body_part["body_part"], body_part["n"], height = 0.4, label = 'n',
       color = 'red', edgecolor = "black", zorder = 2, alpha = 0.7) 

# Barplot labels
ax[1].bar_label(pl2, padding = 5, label_type='edge', fontsize = 8)

# Labels and titles
ax[1].set_ylabel("Body part", fontsize = 10)
ax[1].set_xlabel("# of instances", fontsize = 10)
ax[1].tick_params(axis = 'both', which = 'major', labelsize = 8)
ax[1].spines[['top', 'right']].set_visible(False)


# Set the spacing parameters between subplots
plt.subplots_adjust(left = 0.1,
                    bottom = 1.3, 
                    right = 2, 
                    top = 2, 
                    wspace = 0.65, 
                    hspace = 0.5)

plt.show()

Figure 6: Distribution of shots in different types of play (left) and implemented with different parts of the body (right).

Features play_pattern_name[T.From Keeper], play_pattern_name[T.From Kick Off], play_pattern_name[T.Other] that received high p-values are actually very rare events and can be considered as outliers in the model. There are very few situations in which attack starting from goalkeeper or from a kick-off can result in a goal. From Figure 6, it can be seen that, in total, 300 shots were executed in types of play: ‘From Keeper’, ‘From Kick Off’ and ‘Other’.

Code

print(np.sum(shots.loc[shots['play_pattern_name'] == 'From Keeper', 'outcome']), 
     'goals were scored when attack was initiated by a goalkeeper.',)

13 goals were scored when attack was initiated by a goalkeeper.

Code

print(np.sum(shots.loc[shots['play_pattern_name'] == 'From Kick Off', 'outcome']),
      'goals were scored when attack was initiated from a kick-off.')

10 goals were scored when attack was initiated from a kick-off.

Code

print(np.sum(shots.loc[shots['play_pattern_name'] == 'Other', 'outcome']),
      'goals were scored when attack was initiated in other scenarios.')

7 goals were scored when attack was initiated in other scenarios.

In total, only 30 out of these 300 shots were scored.

A similar pattern can be observed when analyzing body_part categorical column and body_part[T.Other] feature that received a high p-value. Out of 11043 shots available in the dataset, only 30 shots were executed with a body part other than foot or head.

Code

print('Only', np.sum(shots.loc[shots['body_part'] == 'Other', 'outcome']),
      'goals out of 30 shots were scored with a body part other than foot or head.')

Only 6 goals out of 30 shots were scored with a body part other than foot or head.

To sum up both of these scenarios, only 330 shots and 36 goals fall into these outlying conditions. This is a relatively small sample size in comparison to the available data; thus, I exclude these data points from analysis.

Code

shots = shots.loc[~((shots['play_pattern_name'] == 'Other') | (shots['play_pattern_name'] == 'From Keeper' ) 
| (shots['play_pattern_name'] == 'From Kick Off') | (shots['body_part'] == 'Other')),  :]

Finally, there is one more column gk_loc_y which also has a high p-value. This column together with gk_loc_x describe the location of the opposing team’s goalkeeper during an executed shot. Naturally, a goalkeeper standing on the goalline can move differently but in most of the cases along the goalline. This means that the y coordinate should change much more frequently than the x coordinate. However, the y coordinate is less significant in the model than the x coordinate. As of now, I will leave both of these features in the dataset.

Transforming and Splitting Data

One-hot encoding transformation is applied to all categorical variables present in the dataset. These are body_part, technique and play_pattern_name. Note that variables under_pressure, follows_dribble, first_time and open_goal are already transformed into 0/1 boolean variables (False/True); thus, they do not need any additional preprocessing. In addition, features distance, angle, gk_loc_x and gk_loc_y are standardized.

Code

# Prepare features and labels from available data
X = shots.loc[:, ['play_pattern_name','under_pressure', 'distance', 'angle', 'gk_loc_x', 'gk_loc_y',
                   'follows_dribble', 'first_time', 'open_goal', 'technique', 'body_part']]
y = shots.loc[:, 'outcome']

# split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

In the next section, model selection will be implemented using RandomizedSearchCV. Thus, the dataset is splitted into X_train, y_train, X_test, y_test where k-fold cross validation will be applied on X_train, y_train to find optimal parameters for each type of model. Finally, each model will be run on X_test, y_test to evaluate its performance on new data.

Code

# Build a column transformer
column_trans = ColumnTransformer(
    [('encode_bodyparts', OneHotEncoder(dtype='int'), ['play_pattern_name', 'technique', 'body_part']),
    ('std_coords', StandardScaler(), ['distance', 'angle', 'gk_loc_x', 'gk_loc_y'])],
    remainder = 'passthrough', verbose_feature_names_out = True)

# Transform feature columns
X_train = column_trans.fit_transform(X_train)
X_test = column_trans.transform(X_test)

Model Selection

Three different classifiers were run and evaluated on the given dataset. These are Logistic Regression, Gradient Boosting and Random Forest. Since I am interested in predicting probabilistic outputs, the aim is to achieve the highest accuracy of the probabilistic predictions. Thus, I use Brier score as an evaluation metric in RandomizedSearchCV. In addition, I evaluate ROC-AUC score for each model with its best parameters. However, ROC-AUC is mainly used here as a supportive metric to illustrate the performance in comparison with a random guess.

Logistic Regression

Code

# Model
model = LogisticRegression(solver = 'saga', max_iter = 200, random_state = 42)

# Hyperparameters
parameters = dict(C = uniform(loc = 0, scale = 4), 
                  penalty = ['l2', 'l1'])

# Classifier
classifier = RandomizedSearchCV(model, parameters, random_state = 42, 
                                cv = 10, scoring = 'neg_brier_score')
classifier.fit(X_train, y_train)
print('Optimal parameters are:\n', classifier.best_params_)

# Evaluate on test data
predictions = classifier.predict_proba(X_test)[:, 1]
print('Brier score = ', brier_score_loss(y_test, predictions))
print('ROC-AUC = ', roc_auc_score(y_test, predictions))

Optimal parameters are:
 {'C': 1.49816047538945, 'penalty': 'l2'}
Brier score =  0.0778472702630083
ROC-AUC =  0.7810976028573392

Gradient Boosting

Code

# Model
model = GradientBoostingClassifier(random_state = 42)

# Hyperparameters
parameters = dict(learning_rate = uniform(loc = 0.03, scale = 0.035),
                  n_estimators = randint(100, 800),
                  loss = ['exponential', 'deviance'])
# Classifier
classifier = RandomizedSearchCV(model, parameters, random_state = 42, 
                                cv = 10, scoring = 'neg_brier_score')
classifier.fit(X_train, y_train)
print('Optimal parameters are:\n', classifier.best_params_)

# Evaluate on test data
predictions = classifier.predict_proba(X_test)[:, 1]
print('Brier score = ', brier_score_loss(y_test, predictions))
print('ROC-AUC = ', roc_auc_score(y_test, predictions))

Optimal parameters are:
 {'learning_rate': 0.055619787963399184, 'loss': 'exponential', 'n_estimators': 120}
Brier score =  0.07818749793751041
ROC-AUC =  0.7805726452954971

Random Forest

Code

# Model
model = RandomForestClassifier(random_state = 42)

# Hyperparameters
parameters = dict(max_depth = randint(5, 50),
                  criterion = ['entropy', 'gini'],
                  min_samples_split = randint(2, 50))

# Classifier
classifier = RandomizedSearchCV(model, parameters, random_state = 42, 
                                cv = 10, scoring = 'neg_brier_score')
classifier.fit(X_train, y_train)
print('Optimal parameters are:\n', classifier.best_params_)

# Evaluate on test data
predictions = classifier.predict_proba(X_test)[:, 1]
print('Brier score = ', brier_score_loss(y_test, predictions))
print('ROC-AUC = ', roc_auc_score(y_test, predictions))

Optimal parameters are:
 {'criterion': 'entropy', 'max_depth': 12, 'min_samples_split': 22}
Brier score =  0.07892679615459904
ROC-AUC =  0.7761693797650938

To sum up, when comparing Brier scores for each classifier, it can be seen that Logistic Regression outperforms Gradient Boosting and Random Forest by a small margin. In addition, Logistic Regression demonstrates the highest ROC-AUC value (0.781). This value is well above 0.5, which confirms that the classifier performs much better than a random guess.

Future Improvements

The model illustrates a good performance. However, there are always areas for improvement that I would like to briefly outline:

Add more extensive analysis in model selection and evaluation phase.
Perform more thorough analysis of opposing team’s goalkeeper coordinates.
Analyze how discarded outliers may affect model performance.
Analyze the correlation between features using different techniques such as point biserial correlation and chi square test of association.
Use the location of opposing team’s defenders to construct additional features.