Part 1 | Data Preprocessing

Preprocessing raw data with pandas

Author

Farid Musayev

Published

September 6, 2022

In this section, I will demonstrate several steps that are required to preprocess data that will be used for building Expected Goals (xG) Model. Most of preprocessing steps will be implemented using pandas.

In addition to pandas, the following packages need to be imported:

import os
import ast
import numpy as np 
import pandas as pd

There are different publicly available soccer match event datasets. For this model, I decided to work with female soccer match event data due to a high granularity of event descriptions provided by the vendor, Statsbomb. This granularity can help me in building a sophisticated model and design features that can increase its accuracy. To learn more about other available datasets released by Statsbomb, feel free to visit this link.

To extract data from Statsbomb API, different methodologies are available. I prefer to work with socceraction library that allows me to extract data in a convenient pandas.DataFrame format.

# import wyscout public match event data loader from socceraction library
from socceraction.data.statsbomb import StatsBombLoader 

# remove credentials warning from statsbomb api since we work with public data 
import warnings
warnings.filterwarnings("ignore", message="credentials were not supplied. open data access only")

# load public wyscout data
stbm_data = StatsBombLoader()

# read available competitions and filter out only female related ones
competitions = stbm_data.competitions()
female_comps = competitions.loc[competitions['competition_gender'] == 'female', :].reset_index(drop = True)
female_comps

	season_id	competition_id	competition_name	country_name	competition_gender	season_name
0	90	37	FA Women's Super League	England	female	2020/2021
1	42	37	FA Women's Super League	England	female	2019/2020
2	4	37	FA Women's Super League	England	female	2018/2019
3	3	49	NWSL	United States of America	female	2018
4	106	53	UEFA Women's Euro	Europe	female	2022
5	30	72	Women's World Cup	International	female	2019

As can be seen, data is available for four different female soccer competitions. Three seasons of FA Women’s Super League, one season of NWSL and two competitions involving national teams, UEFA Women’s Euro 2022 and Women’s World Cup 2019.

Below code illustrates steps required to read event data on each game from the aforementioned competitions and save it as .csv file. I also save all .csv files into a single all_events dataframe. Later, this will allow me to extract an event of interest from all games at once.

# names of folders to save files
dir_names = ['FAWSL_2021', 'FAWSL_1920', 'FAWSL_1819', 'NWSL', 'EURO_2022', 'WC_2019']

# for each competition save all games as .csv files
for i, j in female_comps.loc[:, ['season_id', 'competition_id']].iterrows():
    # j[0] = season_id, j[1] = competition_id
    games = stbm_data.games(j[1], j[0]).loc[:, 'game_id']
    for k in games:
        events = stbm_data.events(k)
        events.to_csv(f'.data/{dir_names[i]}/games/{k}.csv', index = False)

# concatenate all events into a single data frame
all_events = pd.DataFrame()
for i in dir_names:
    games = os.listdir(f'.data/{i}/games')
    for j in games:
        df = pd.read_csv(f'.data/{i}/games/{j}')
        all_events = pd.concat([all_events, df])

# rest index and save as .csv file
all_events = all_events.reset_index(drop = True)
all_events.to_csv('.data/all_events.csv', index = False)

all_events.head(3)

	game_id	event_id	period_id	team_id	player_id	type_id	type_name	index	timestamp	...	team_name	extra	related_events	player_name	position_id	position_name	location	under_pressure	counterpress
0	3764230	3f5dde74-d91b-44ea-9a1f-88e84da555ab	1	749	NaN	35	Starting XI	1	1900-01-01 00:00:00.000	...	Tottenham Hotspur Women	{'tactics': {'formation': 4231, 'lineup': [{'p...	[]	NaN	NaN	NaN	NaN	False	False
1	3764230	e4fefe61-4e08-47e0-be4d-2276388e6eb4	1	972	NaN	35	Starting XI	2	1900-01-01 00:00:00.000	...	West Ham United LFC	{'tactics': {'formation': 433, 'lineup': [{'pl...	[]	NaN	NaN	NaN	NaN	False	False
2	3764230	ff9a99d3-3efd-45c2-8736-a8a93dd02638	1	972	NaN	18	Half Start	3	1900-01-01 00:00:00.000	...	West Ham United LFC	{}	['5fb7026c-83aa-4490-96b1-a55825c4dcb8']	NaN	NaN	NaN	NaN	False	False

3 rows × 26 columns

There is a wide range of data describing each event. Since xG model evaluates the probability of a shot to result in a goal, I can filter only shot events, extract columns of interest to this event and test these columns after preprocessing in the model building phase.

# list all features to select ones required for xG model
all_events.columns

Index(['game_id', 'event_id', 'period_id', 'team_id', 'player_id', 'type_id',
       'type_name', 'index', 'timestamp', 'minute', 'second', 'possession',
       'possession_team_id', 'possession_team_name', 'play_pattern_id',
       'play_pattern_name', 'team_name', 'duration', 'extra', 'related_events',
       'player_name', 'position_id', 'position_name', 'location',
       'under_pressure', 'counterpress'],
      dtype='object')

# filter event type_name = 'Shot' and leave only required columns 
shots = all_events.loc[all_events['type_name'] == 'Shot', 
                       ['minute', 'player_name', 'team_name', 'type_name', 'play_pattern_name', 
                        'position_name', 'location', 'under_pressure', 'extra']].reset_index(drop = True)

The following columns are dropped due to their irrelevance to the context of the model: ‘game_id’, ‘event_id’, ‘period_id’, ‘team_id’, ‘player_id’, ‘type_id’, ‘index’, ‘timestamp’, ‘minute’, ‘second’, ‘possession’, ‘possession_team_id’, ‘possession_team_name’, ‘play_pattern_id’, ‘duration’, ‘related_events’, ‘position_id’.

As you can see, a majority of these events are id identifiers. For example, play_pattern_id is ommited while play_pattern is left in the dataframe. The rest of the columns include time- or possession-related information which will not make any use in our case.

One of the most important columns, as we will see later, is location of the shot. I extract required (x, y) coordinates from a given list and save them as separate columns for a simpler use case during feature engineering phase.

It is important to note that when dataframes are saved as .csv files, all of them are converted into a raw string format. Thus, when reading these dataframes, one needs to convert columns containing specific datatypes into a python readable format. For that, I use ast package and, specifically, ast.literal_eval() function. This allows me to convert a string of a list into a python readable list object.

# unlist location column into (x, y) and remove it
shots.loc[:, 'location'] = shots.loc[:, 'location'].apply(ast.literal_eval)
shots.loc[:, 'x_start'] = shots.loc[:, 'location'].apply(lambda x: x[0])
shots.loc[:, 'y_start'] = shots.loc[:, 'location'].apply(lambda x: x[1])
shots = shots.drop(columns = 'location')

shots.head()

	minute	player_name	team_name	type_name	play_pattern_name	position_name	under_pressure	extra	x_start	y_start
0	7	Lucy Quinn	Tottenham Hotspur Women	Shot	Regular Play	Right Wing	False	{'shot': {'statsbomb_xg': 0.013642391, 'end_lo...	95.9	58.9
1	10	Rianna Dean	Tottenham Hotspur Women	Shot	From Free Kick	Center Forward	False	{'shot': {'statsbomb_xg': 0.04084396, 'end_loc...	106.1	54.3
2	11	Angela Addison	Tottenham Hotspur Women	Shot	From Free Kick	Left Wing	True	{'shot': {'statsbomb_xg': 0.13687119, 'end_loc...	110.0	28.2
3	13	Kit Graham	Tottenham Hotspur Women	Shot	From Throw In	Center Attacking Midfield	False	{'shot': {'statsbomb_xg': 0.12462413, 'end_loc...	113.2	40.4
4	16	Kit Graham	Tottenham Hotspur Women	Shot	From Counter	Center Attacking Midfield	False	{'shot': {'statsbomb_xg': 0.02380701, 'end_loc...	95.2	39.8

The column named extra contains additional information describing shot event. This is where we can observe that Statsbomb provides a high level of event data granularity. For example, below you can see that for each shot, the location of all players, specifically opposing team’s goalkeeper, within a visible video frame is recorded. In addition, there is data about body_part with which a shot was implemented, technique (which as per event data description guide is “name of the technique used for this shot”), open_goal which is a boolean variable that describes if a shot was taken with an open goal, follows_dribble which is a boolean variable that describes if a taken shot was followed by dribble or not and first_time which is a boolean variable that describes if a shot was taken with the first touch or not. Due to vendor specifications, only boolean variables with True state appear in extra column; thus, I have to specify False state for all other cases explicitly.

As you can see, variables follows_dribble and open_goal are missing from below instance of extra column due to False state.

shots.loc[:, 'extra'][148]

{'shot': {'open_goal': True,
  'statsbomb_xg': 0.84770715,
  'end_location': [120.0, 39.0, 0.9],
  'body_part': {'id': 40, 'name': 'Right Foot'},
  'type': {'id': 87, 'name': 'Open Play'},
  'outcome': {'id': 97, 'name': 'Goal'},
  'first_time': True,
  'technique': {'id': 91, 'name': 'Half Volley'},
  'freeze_frame': [{'location': [111.4, 38.3],
    'player': {'id': 4647, 'name': 'So-Yun Ji'},
    'position': {'id': 13, 'name': 'Right Center Midfield'},
    'teammate': True},
   {'location': [108.8, 42.9],
    'player': {'id': 4636, 'name': 'Maria Thorisdottir'},
    'position': {'id': 2, 'name': 'Right Back'},
    'teammate': True},
   {'location': [110.1, 53.6],
    'player': {'id': 4961, 'name': 'Samantha May Kerr'},
    'position': {'id': 21, 'name': 'Left Wing'},
    'teammate': True},
   {'location': [106.3, 51.4],
    'player': {'id': 10108, 'name': 'Pernille Mosegaard Harder'},
    'position': {'id': 17, 'name': 'Right Wing'},
    'teammate': True},
   {'location': [110.8, 35.7],
    'player': {'id': 46738, 'name': 'Emma Bissell'},
    'position': {'id': 16, 'name': 'Left Midfield'},
    'teammate': False},
   {'location': [110.1, 39.8],
    'player': {'id': 36801, 'name': 'Aimee Palmer'},
    'position': {'id': 13, 'name': 'Right Center Midfield'},
    'teammate': False},
   {'location': [116.1, 42.8],
    'player': {'id': 16376, 'name': 'Sophie Baggaley'},
    'position': {'id': 1, 'name': 'Goalkeeper'},
    'teammate': False},
   {'location': [116.1, 45.7],
    'player': {'id': 16381, 'name': 'Gemma Evans'},
    'position': {'id': 5, 'name': 'Left Center Back'},
    'teammate': False},
   {'location': [110.5, 46.1],
    'player': {'id': 15618, 'name': 'Jasmine Matthews'},
    'position': {'id': 3, 'name': 'Right Center Back'},
    'teammate': False},
   {'location': [111.2, 50.0],
    'player': {'id': 24922, 'name': 'Florence Allen'},
    'position': {'id': 2, 'name': 'Right Back'},
    'teammate': False},
   {'location': [114.5, 49.3],
    'player': {'id': 24239, 'name': 'Jemma Elizabeth Purfield'},
    'position': {'id': 6, 'name': 'Left Back'},
    'teammate': False}]}}

In addition to above-mentioned data, I also extract contenxtual information from extra column. These are type, statsbomb_xg and outcome variables. The last one is important for knowing if a taken shot results in a goal or not. The variable type will help me to filter out only open play situations and discard outlying conditions where a shot is taken directly from corner, free-kick, penalty or kick-off. These are situations that can largely skew performance of the proposed xG model, and it is better to build a separate model that focuses only on them.

I unpack extra column that consists of dictionaries, extract required data and save it as separate columns in the dataframe.

# convert 'extra' column to dict readable format using ast.literal_eval
shots_extra = shots.loc[:, 'extra'].apply(ast.literal_eval).reset_index(drop = True)

# specify which features to extract from 'extra' column
keys = ['follows_dribble', 'first_time', 'open_goal', 'statsbomb_xg', \
        'type', 'technique', 'body_part', 'outcome']
# save selected features in a dataframe
extra_features = pd.DataFrame(np.nan, columns = keys, index = range(shots.shape[0]))
for i, j in shots_extra.iteritems():
    for k in list(j['shot'].keys()):
        if k in ['type', 'technique', 'body_part', 'outcome']:
            extra_features.loc[i, k] = j['shot'][k]['name']
        elif k in keys:
            extra_features.loc[i, k] = j['shot'][k]
        elif k == 'freeze_frame':
            extra_features.loc[i, k] = [{'freeze_frame':j['shot'][k]}]
        elif (k == 'end_location'):
            extra_features.loc[i, 'end_loc'] = [{'end_loc':j['shot'][k]}]
            
        
# fill NAs with boolean = False (technically, these are not NAs but just undeclared False values)
extra_features = extra_features.fillna(value = False)
# transform columns with boolean values into integers 
extra_features.loc[:, ['follows_dribble', 'first_time', 'open_goal']] = \
extra_features.loc[:, ['follows_dribble', 'first_time', 'open_goal']].astype(int)
shots.loc[:, 'under_pressure'] = shots.loc[:, 'under_pressure'].astype(int)

shots = pd.concat([shots.drop(columns = ['extra', 'type_name']), extra_features], axis = 1)
shots.head()

	minute	player_name	team_name	play_pattern_name	position_name	under_pressure	x_start	y_start	statsbomb_xg	type	technique	body_part	outcome	end_loc	freeze_frame
0	7	Lucy Quinn	Tottenham Hotspur Women	Regular Play	Right Wing	0	95.9	58.9	0.013642	Open Play	Normal	Left Foot	Saved	[{'end_loc': [116.7, 44.9, 1.2]}]	[{'freeze_frame': [{'location': [119.6, 42.3],...
1	10	Rianna Dean	Tottenham Hotspur Women	From Free Kick	Center Forward	0	106.1	54.3	0.040844	Open Play	Normal	Right Foot	Off T	[{'end_loc': [120.0, 41.6, 4.2]}]	[{'freeze_frame': [{'location': [118.8, 43.2],...
2	11	Angela Addison	Tottenham Hotspur Women	From Free Kick	Left Wing	1	110.0	28.2	0.136871	Open Play	Normal	Left Foot	Saved	[{'end_loc': [117.6, 36.7, 0.4]}]	[{'freeze_frame': [{'location': [111.3, 39.8],...
3	13	Kit Graham	Tottenham Hotspur Women	From Throw In	Center Attacking Midfield	0	113.2	40.4	0.124624	Open Play	Normal	Head	Post	[{'end_loc': [120.0, 37.9, 2.9]}]	[{'freeze_frame': [{'location': [105.8, 46.6],...
4	16	Kit Graham	Tottenham Hotspur Women	From Counter	Center Attacking Midfield	0	95.2	39.8	0.023807	Open Play	Normal	Left Foot	Post	[{'end_loc': [120.0, 37.3, 2.9]}]	[{'freeze_frame': [{'location': [97.8, 49.4], ...

Also, I would like to extract opposing team’s goalkeeper location during each executed shot. These coordinates are contained in freeze_frame column.

# write a custom function to unpack dictionaries within freeze_frame column
def ff_unpacking(players):
    players = players[0]['freeze_frame']
    for i in players:
        if i['position']['name'] == 'Goalkeeper' and i['teammate'] == False:
            gk_loc = i['location']
            return gk_loc

shots.loc[:, 'gk_loc'] = shots.loc[:, 'freeze_frame'].apply(lambda x: ff_unpacking(x))

# Note that there are 42 None instances where goalkeeper location was incorrectly labeled.
shots = shots.loc[~shots.loc[:, 'gk_loc'].isnull(), :].reset_index(drop = True)

# save (x, y) coordinates of a goalkeeper as separate columns
shots.loc[:, 'gk_loc_x'] = shots.loc[:, 'gk_loc'].apply(lambda x: x[0])
shots.loc[:, 'gk_loc_y'] = shots.loc[:, 'gk_loc'].apply(lambda x: x[1])

Finally, I save my shots dataframe as .csv file.

shots.to_csv('.data/shots.csv', index = False)

This is the end of preprocessing stage for data that will be used in the proposed xG model. Now, we can move on to the model building phase that will focus on exploratory data analysis, feature engineering and model selection.