import os
import ast
import numpy as np
import pandas as pd
Part 1 | Data Preprocessing
Preprocessing raw data with pandas
In this section, I will demonstrate several steps that are required to preprocess data that will be used for building Expected Goals (xG) Model. Most of preprocessing steps will be implemented using pandas
.
In addition to pandas, the following packages need to be imported:
There are different publicly available soccer match event datasets. For this model, I decided to work with female soccer match event data due to a high granularity of event descriptions provided by the vendor, Statsbomb. This granularity can help me in building a sophisticated model and design features that can increase its accuracy. To learn more about other available datasets released by Statsbomb, feel free to visit this link.
To extract data from Statsbomb API, different methodologies are available. I prefer to work with socceraction
library that allows me to extract data in a convenient pandas.DataFrame
format.
# import wyscout public match event data loader from socceraction library
from socceraction.data.statsbomb import StatsBombLoader
# remove credentials warning from statsbomb api since we work with public data
import warnings
"ignore", message="credentials were not supplied. open data access only")
warnings.filterwarnings(
# load public wyscout data
= StatsBombLoader()
stbm_data
# read available competitions and filter out only female related ones
= stbm_data.competitions()
competitions = competitions.loc[competitions['competition_gender'] == 'female', :].reset_index(drop = True)
female_comps female_comps
season_id | competition_id | competition_name | country_name | competition_gender | season_name | |
---|---|---|---|---|---|---|
0 | 90 | 37 | FA Women's Super League | England | female | 2020/2021 |
1 | 42 | 37 | FA Women's Super League | England | female | 2019/2020 |
2 | 4 | 37 | FA Women's Super League | England | female | 2018/2019 |
3 | 3 | 49 | NWSL | United States of America | female | 2018 |
4 | 106 | 53 | UEFA Women's Euro | Europe | female | 2022 |
5 | 30 | 72 | Women's World Cup | International | female | 2019 |
As can be seen, data is available for four different female soccer competitions. Three seasons of FA Women’s Super League, one season of NWSL and two competitions involving national teams, UEFA Women’s Euro 2022 and Women’s World Cup 2019.
Below code illustrates steps required to read event data on each game from the aforementioned competitions and save it as .csv
file. I also save all .csv
files into a single all_events
dataframe. Later, this will allow me to extract an event of interest from all games at once.
# names of folders to save files
= ['FAWSL_2021', 'FAWSL_1920', 'FAWSL_1819', 'NWSL', 'EURO_2022', 'WC_2019']
dir_names
# for each competition save all games as .csv files
for i, j in female_comps.loc[:, ['season_id', 'competition_id']].iterrows():
# j[0] = season_id, j[1] = competition_id
= stbm_data.games(j[1], j[0]).loc[:, 'game_id']
games for k in games:
= stbm_data.events(k)
events f'.data/{dir_names[i]}/games/{k}.csv', index = False)
events.to_csv(
# concatenate all events into a single data frame
= pd.DataFrame()
all_events for i in dir_names:
= os.listdir(f'.data/{i}/games')
games for j in games:
= pd.read_csv(f'.data/{i}/games/{j}')
df = pd.concat([all_events, df])
all_events
# rest index and save as .csv file
= all_events.reset_index(drop = True)
all_events '.data/all_events.csv', index = False) all_events.to_csv(
3) all_events.head(
game_id | event_id | period_id | team_id | player_id | type_id | type_name | index | timestamp | minute | ... | team_name | duration | extra | related_events | player_name | position_id | position_name | location | under_pressure | counterpress | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3764230 | 3f5dde74-d91b-44ea-9a1f-88e84da555ab | 1 | 749 | NaN | 35 | Starting XI | 1 | 1900-01-01 00:00:00.000 | 0 | ... | Tottenham Hotspur Women | 0.0 | {'tactics': {'formation': 4231, 'lineup': [{'p... | [] | NaN | NaN | NaN | NaN | False | False |
1 | 3764230 | e4fefe61-4e08-47e0-be4d-2276388e6eb4 | 1 | 972 | NaN | 35 | Starting XI | 2 | 1900-01-01 00:00:00.000 | 0 | ... | West Ham United LFC | 0.0 | {'tactics': {'formation': 433, 'lineup': [{'pl... | [] | NaN | NaN | NaN | NaN | False | False |
2 | 3764230 | ff9a99d3-3efd-45c2-8736-a8a93dd02638 | 1 | 972 | NaN | 18 | Half Start | 3 | 1900-01-01 00:00:00.000 | 0 | ... | West Ham United LFC | 0.0 | {} | ['5fb7026c-83aa-4490-96b1-a55825c4dcb8'] | NaN | NaN | NaN | NaN | False | False |
3 rows × 26 columns
There is a wide range of data describing each event. Since xG model evaluates the probability of a shot to result in a goal, I can filter only shot
events, extract columns of interest to this event and test these columns after preprocessing in the model building phase.
# list all features to select ones required for xG model
all_events.columns
Index(['game_id', 'event_id', 'period_id', 'team_id', 'player_id', 'type_id',
'type_name', 'index', 'timestamp', 'minute', 'second', 'possession',
'possession_team_id', 'possession_team_name', 'play_pattern_id',
'play_pattern_name', 'team_name', 'duration', 'extra', 'related_events',
'player_name', 'position_id', 'position_name', 'location',
'under_pressure', 'counterpress'],
dtype='object')
# filter event type_name = 'Shot' and leave only required columns
= all_events.loc[all_events['type_name'] == 'Shot',
shots 'minute', 'player_name', 'team_name', 'type_name', 'play_pattern_name',
['position_name', 'location', 'under_pressure', 'extra']].reset_index(drop = True)
The following columns are dropped due to their irrelevance to the context of the model: ‘game_id’, ‘event_id’, ‘period_id’, ‘team_id’, ‘player_id’, ‘type_id’, ‘index’, ‘timestamp’, ‘minute’, ‘second’, ‘possession’, ‘possession_team_id’, ‘possession_team_name’, ‘play_pattern_id’, ‘duration’, ‘related_events’, ‘position_id’.
As you can see, a majority of these events are id
identifiers. For example, play_pattern_id
is ommited while play_pattern
is left in the dataframe. The rest of the columns include time- or possession-related information which will not make any use in our case.
One of the most important columns, as we will see later, is location
of the shot. I extract required (x, y) coordinates from a given list and save them as separate columns for a simpler use case during feature engineering phase.
It is important to note that when dataframes are saved as .csv
files, all of them are converted into a raw string format. Thus, when reading these dataframes, one needs to convert columns containing specific datatypes into a python readable format. For that, I use ast
package and, specifically, ast.literal_eval()
function. This allows me to convert a string of a list into a python readable list object.
# unlist location column into (x, y) and remove it
'location'] = shots.loc[:, 'location'].apply(ast.literal_eval)
shots.loc[:, 'x_start'] = shots.loc[:, 'location'].apply(lambda x: x[0])
shots.loc[:, 'y_start'] = shots.loc[:, 'location'].apply(lambda x: x[1])
shots.loc[:, = shots.drop(columns = 'location') shots
shots.head()
minute | player_name | team_name | type_name | play_pattern_name | position_name | under_pressure | extra | x_start | y_start | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 7 | Lucy Quinn | Tottenham Hotspur Women | Shot | Regular Play | Right Wing | False | {'shot': {'statsbomb_xg': 0.013642391, 'end_lo... | 95.9 | 58.9 |
1 | 10 | Rianna Dean | Tottenham Hotspur Women | Shot | From Free Kick | Center Forward | False | {'shot': {'statsbomb_xg': 0.04084396, 'end_loc... | 106.1 | 54.3 |
2 | 11 | Angela Addison | Tottenham Hotspur Women | Shot | From Free Kick | Left Wing | True | {'shot': {'statsbomb_xg': 0.13687119, 'end_loc... | 110.0 | 28.2 |
3 | 13 | Kit Graham | Tottenham Hotspur Women | Shot | From Throw In | Center Attacking Midfield | False | {'shot': {'statsbomb_xg': 0.12462413, 'end_loc... | 113.2 | 40.4 |
4 | 16 | Kit Graham | Tottenham Hotspur Women | Shot | From Counter | Center Attacking Midfield | False | {'shot': {'statsbomb_xg': 0.02380701, 'end_loc... | 95.2 | 39.8 |
The column named extra
contains additional information describing shot
event. This is where we can observe that Statsbomb provides a high level of event data granularity. For example, below you can see that for each shot, the location of all players, specifically opposing team’s goalkeeper, within a visible video frame is recorded. In addition, there is data about body_part
with which a shot was implemented, technique
(which as per event data description guide is “name of the technique used for this shot”), open_goal
which is a boolean variable that describes if a shot was taken with an open goal, follows_dribble
which is a boolean variable that describes if a taken shot was followed by dribble or not and first_time
which is a boolean variable that describes if a shot was taken with the first touch or not. Due to vendor specifications, only boolean variables with True
state appear in extra
column; thus, I have to specify False
state for all other cases explicitly.
As you can see, variables follows_dribble
and open_goal
are missing from below instance of extra
column due to False
state.
'extra'][148] shots.loc[:,
{'shot': {'open_goal': True,
'statsbomb_xg': 0.84770715,
'end_location': [120.0, 39.0, 0.9],
'body_part': {'id': 40, 'name': 'Right Foot'},
'type': {'id': 87, 'name': 'Open Play'},
'outcome': {'id': 97, 'name': 'Goal'},
'first_time': True,
'technique': {'id': 91, 'name': 'Half Volley'},
'freeze_frame': [{'location': [111.4, 38.3],
'player': {'id': 4647, 'name': 'So-Yun Ji'},
'position': {'id': 13, 'name': 'Right Center Midfield'},
'teammate': True},
{'location': [108.8, 42.9],
'player': {'id': 4636, 'name': 'Maria Thorisdottir'},
'position': {'id': 2, 'name': 'Right Back'},
'teammate': True},
{'location': [110.1, 53.6],
'player': {'id': 4961, 'name': 'Samantha May Kerr'},
'position': {'id': 21, 'name': 'Left Wing'},
'teammate': True},
{'location': [106.3, 51.4],
'player': {'id': 10108, 'name': 'Pernille Mosegaard Harder'},
'position': {'id': 17, 'name': 'Right Wing'},
'teammate': True},
{'location': [110.8, 35.7],
'player': {'id': 46738, 'name': 'Emma Bissell'},
'position': {'id': 16, 'name': 'Left Midfield'},
'teammate': False},
{'location': [110.1, 39.8],
'player': {'id': 36801, 'name': 'Aimee Palmer'},
'position': {'id': 13, 'name': 'Right Center Midfield'},
'teammate': False},
{'location': [116.1, 42.8],
'player': {'id': 16376, 'name': 'Sophie Baggaley'},
'position': {'id': 1, 'name': 'Goalkeeper'},
'teammate': False},
{'location': [116.1, 45.7],
'player': {'id': 16381, 'name': 'Gemma Evans'},
'position': {'id': 5, 'name': 'Left Center Back'},
'teammate': False},
{'location': [110.5, 46.1],
'player': {'id': 15618, 'name': 'Jasmine Matthews'},
'position': {'id': 3, 'name': 'Right Center Back'},
'teammate': False},
{'location': [111.2, 50.0],
'player': {'id': 24922, 'name': 'Florence Allen'},
'position': {'id': 2, 'name': 'Right Back'},
'teammate': False},
{'location': [114.5, 49.3],
'player': {'id': 24239, 'name': 'Jemma Elizabeth Purfield'},
'position': {'id': 6, 'name': 'Left Back'},
'teammate': False}]}}
In addition to above-mentioned data, I also extract contenxtual information from extra
column. These are type
, statsbomb_xg
and outcome
variables. The last one is important for knowing if a taken shot results in a goal or not. The variable type
will help me to filter out only open play situations and discard outlying conditions where a shot is taken directly from corner, free-kick, penalty or kick-off. These are situations that can largely skew performance of the proposed xG model, and it is better to build a separate model that focuses only on them.
I unpack extra
column that consists of dictionaries, extract required data and save it as separate columns in the dataframe.
# convert 'extra' column to dict readable format using ast.literal_eval
= shots.loc[:, 'extra'].apply(ast.literal_eval).reset_index(drop = True)
shots_extra
# specify which features to extract from 'extra' column
= ['follows_dribble', 'first_time', 'open_goal', 'statsbomb_xg', \
keys 'type', 'technique', 'body_part', 'outcome']
# save selected features in a dataframe
= pd.DataFrame(np.nan, columns = keys, index = range(shots.shape[0]))
extra_features for i, j in shots_extra.iteritems():
for k in list(j['shot'].keys()):
if k in ['type', 'technique', 'body_part', 'outcome']:
= j['shot'][k]['name']
extra_features.loc[i, k] elif k in keys:
= j['shot'][k]
extra_features.loc[i, k] elif k == 'freeze_frame':
= [{'freeze_frame':j['shot'][k]}]
extra_features.loc[i, k] elif (k == 'end_location'):
'end_loc'] = [{'end_loc':j['shot'][k]}]
extra_features.loc[i,
# fill NAs with boolean = False (technically, these are not NAs but just undeclared False values)
= extra_features.fillna(value = False)
extra_features # transform columns with boolean values into integers
'follows_dribble', 'first_time', 'open_goal']] = \
extra_features.loc[:, ['follows_dribble', 'first_time', 'open_goal']].astype(int)
extra_features.loc[:, ['under_pressure'] = shots.loc[:, 'under_pressure'].astype(int) shots.loc[:,
= pd.concat([shots.drop(columns = ['extra', 'type_name']), extra_features], axis = 1)
shots shots.head()
minute | player_name | team_name | play_pattern_name | position_name | under_pressure | x_start | y_start | follows_dribble | first_time | open_goal | statsbomb_xg | type | technique | body_part | outcome | end_loc | freeze_frame | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 7 | Lucy Quinn | Tottenham Hotspur Women | Regular Play | Right Wing | 0 | 95.9 | 58.9 | 0 | 0 | 0 | 0.013642 | Open Play | Normal | Left Foot | Saved | [{'end_loc': [116.7, 44.9, 1.2]}] | [{'freeze_frame': [{'location': [119.6, 42.3],... |
1 | 10 | Rianna Dean | Tottenham Hotspur Women | From Free Kick | Center Forward | 0 | 106.1 | 54.3 | 0 | 0 | 0 | 0.040844 | Open Play | Normal | Right Foot | Off T | [{'end_loc': [120.0, 41.6, 4.2]}] | [{'freeze_frame': [{'location': [118.8, 43.2],... |
2 | 11 | Angela Addison | Tottenham Hotspur Women | From Free Kick | Left Wing | 1 | 110.0 | 28.2 | 0 | 0 | 0 | 0.136871 | Open Play | Normal | Left Foot | Saved | [{'end_loc': [117.6, 36.7, 0.4]}] | [{'freeze_frame': [{'location': [111.3, 39.8],... |
3 | 13 | Kit Graham | Tottenham Hotspur Women | From Throw In | Center Attacking Midfield | 0 | 113.2 | 40.4 | 0 | 0 | 0 | 0.124624 | Open Play | Normal | Head | Post | [{'end_loc': [120.0, 37.9, 2.9]}] | [{'freeze_frame': [{'location': [105.8, 46.6],... |
4 | 16 | Kit Graham | Tottenham Hotspur Women | From Counter | Center Attacking Midfield | 0 | 95.2 | 39.8 | 0 | 0 | 0 | 0.023807 | Open Play | Normal | Left Foot | Post | [{'end_loc': [120.0, 37.3, 2.9]}] | [{'freeze_frame': [{'location': [97.8, 49.4], ... |
Also, I would like to extract opposing team’s goalkeeper location during each executed shot. These coordinates are contained in freeze_frame
column.
# write a custom function to unpack dictionaries within freeze_frame column
def ff_unpacking(players):
= players[0]['freeze_frame']
players for i in players:
if i['position']['name'] == 'Goalkeeper' and i['teammate'] == False:
= i['location']
gk_loc return gk_loc
'gk_loc'] = shots.loc[:, 'freeze_frame'].apply(lambda x: ff_unpacking(x))
shots.loc[:,
# Note that there are 42 None instances where goalkeeper location was incorrectly labeled.
= shots.loc[~shots.loc[:, 'gk_loc'].isnull(), :].reset_index(drop = True)
shots
# save (x, y) coordinates of a goalkeeper as separate columns
'gk_loc_x'] = shots.loc[:, 'gk_loc'].apply(lambda x: x[0])
shots.loc[:, 'gk_loc_y'] = shots.loc[:, 'gk_loc'].apply(lambda x: x[1]) shots.loc[:,
Finally, I save my shots
dataframe as .csv
file.
'.data/shots.csv', index = False) shots.to_csv(
This is the end of preprocessing stage for data that will be used in the proposed xG model. Now, we can move on to the model building phase that will focus on exploratory data analysis, feature engineering and model selection.