Part 1 | Data Preprocessing

Preprocessing raw data with pandas


Farid Musayev


September 6, 2022

In this section, I will demonstrate several steps that are required to preprocess data that will be used for building Expected Goals (xG) Model. Most of preprocessing steps will be implemented using pandas.

In addition to pandas, the following packages need to be imported:

import os
import ast
import numpy as np 
import pandas as pd

There are different publicly available soccer match event datasets. For this model, I decided to work with female soccer match event data due to a high granularity of event descriptions provided by the vendor, Statsbomb. This granularity can help me in building a sophisticated model and design features that can increase its accuracy. To learn more about other available datasets released by Statsbomb, feel free to visit this link.

To extract data from Statsbomb API, different methodologies are available. I prefer to work with socceraction library that allows me to extract data in a convenient pandas.DataFrame format.

# import wyscout public match event data loader from socceraction library
from import StatsBombLoader 

# remove credentials warning from statsbomb api since we work with public data 
import warnings
warnings.filterwarnings("ignore", message="credentials were not supplied. open data access only")

# load public wyscout data
stbm_data = StatsBombLoader()

# read available competitions and filter out only female related ones
competitions = stbm_data.competitions()
female_comps = competitions.loc[competitions['competition_gender'] == 'female', :].reset_index(drop = True)
season_id competition_id competition_name country_name competition_gender season_name
0 90 37 FA Women's Super League England female 2020/2021
1 42 37 FA Women's Super League England female 2019/2020
2 4 37 FA Women's Super League England female 2018/2019
3 3 49 NWSL United States of America female 2018
4 106 53 UEFA Women's Euro Europe female 2022
5 30 72 Women's World Cup International female 2019

As can be seen, data is available for four different female soccer competitions. Three seasons of FA Women’s Super League, one season of NWSL and two competitions involving national teams, UEFA Women’s Euro 2022 and Women’s World Cup 2019.

Below code illustrates steps required to read event data on each game from the aforementioned competitions and save it as .csv file. I also save all .csv files into a single all_events dataframe. Later, this will allow me to extract an event of interest from all games at once.

# names of folders to save files
dir_names = ['FAWSL_2021', 'FAWSL_1920', 'FAWSL_1819', 'NWSL', 'EURO_2022', 'WC_2019']

# for each competition save all games as .csv files
for i, j in female_comps.loc[:, ['season_id', 'competition_id']].iterrows():
    # j[0] = season_id, j[1] = competition_id
    games =[1], j[0]).loc[:, 'game_id']
    for k in games:
        events =
        events.to_csv(f'.data/{dir_names[i]}/games/{k}.csv', index = False)

# concatenate all events into a single data frame
all_events = pd.DataFrame()
for i in dir_names:
    games = os.listdir(f'.data/{i}/games')
    for j in games:
        df = pd.read_csv(f'.data/{i}/games/{j}')
        all_events = pd.concat([all_events, df])

# rest index and save as .csv file
all_events = all_events.reset_index(drop = True)
all_events.to_csv('.data/all_events.csv', index = False)
game_id event_id period_id team_id player_id type_id type_name index timestamp minute ... team_name duration extra related_events player_name position_id position_name location under_pressure counterpress
0 3764230 3f5dde74-d91b-44ea-9a1f-88e84da555ab 1 749 NaN 35 Starting XI 1 1900-01-01 00:00:00.000 0 ... Tottenham Hotspur Women 0.0 {'tactics': {'formation': 4231, 'lineup': [{'p... [] NaN NaN NaN NaN False False
1 3764230 e4fefe61-4e08-47e0-be4d-2276388e6eb4 1 972 NaN 35 Starting XI 2 1900-01-01 00:00:00.000 0 ... West Ham United LFC 0.0 {'tactics': {'formation': 433, 'lineup': [{'pl... [] NaN NaN NaN NaN False False
2 3764230 ff9a99d3-3efd-45c2-8736-a8a93dd02638 1 972 NaN 18 Half Start 3 1900-01-01 00:00:00.000 0 ... West Ham United LFC 0.0 {} ['5fb7026c-83aa-4490-96b1-a55825c4dcb8'] NaN NaN NaN NaN False False

3 rows × 26 columns

There is a wide range of data describing each event. Since xG model evaluates the probability of a shot to result in a goal, I can filter only shot events, extract columns of interest to this event and test these columns after preprocessing in the model building phase.

# list all features to select ones required for xG model
Index(['game_id', 'event_id', 'period_id', 'team_id', 'player_id', 'type_id',
       'type_name', 'index', 'timestamp', 'minute', 'second', 'possession',
       'possession_team_id', 'possession_team_name', 'play_pattern_id',
       'play_pattern_name', 'team_name', 'duration', 'extra', 'related_events',
       'player_name', 'position_id', 'position_name', 'location',
       'under_pressure', 'counterpress'],
# filter event type_name = 'Shot' and leave only required columns 
shots = all_events.loc[all_events['type_name'] == 'Shot', 
                       ['minute', 'player_name', 'team_name', 'type_name', 'play_pattern_name', 
                        'position_name', 'location', 'under_pressure', 'extra']].reset_index(drop = True)

The following columns are dropped due to their irrelevance to the context of the model: ‘game_id’, ‘event_id’, ‘period_id’, ‘team_id’, ‘player_id’, ‘type_id’, ‘index’, ‘timestamp’, ‘minute’, ‘second’, ‘possession’, ‘possession_team_id’, ‘possession_team_name’, ‘play_pattern_id’, ‘duration’, ‘related_events’, ‘position_id’.

As you can see, a majority of these events are id identifiers. For example, play_pattern_id is ommited while play_pattern is left in the dataframe. The rest of the columns include time- or possession-related information which will not make any use in our case.

One of the most important columns, as we will see later, is location of the shot. I extract required (x, y) coordinates from a given list and save them as separate columns for a simpler use case during feature engineering phase.

It is important to note that when dataframes are saved as .csv files, all of them are converted into a raw string format. Thus, when reading these dataframes, one needs to convert columns containing specific datatypes into a python readable format. For that, I use ast package and, specifically, ast.literal_eval() function. This allows me to convert a string of a list into a python readable list object.

# unlist location column into (x, y) and remove it
shots.loc[:, 'location'] = shots.loc[:, 'location'].apply(ast.literal_eval)
shots.loc[:, 'x_start'] = shots.loc[:, 'location'].apply(lambda x: x[0])
shots.loc[:, 'y_start'] = shots.loc[:, 'location'].apply(lambda x: x[1])
shots = shots.drop(columns = 'location')
minute player_name team_name type_name play_pattern_name position_name under_pressure extra x_start y_start
0 7 Lucy Quinn Tottenham Hotspur Women Shot Regular Play Right Wing False {'shot': {'statsbomb_xg': 0.013642391, 'end_lo... 95.9 58.9
1 10 Rianna Dean Tottenham Hotspur Women Shot From Free Kick Center Forward False {'shot': {'statsbomb_xg': 0.04084396, 'end_loc... 106.1 54.3
2 11 Angela Addison Tottenham Hotspur Women Shot From Free Kick Left Wing True {'shot': {'statsbomb_xg': 0.13687119, 'end_loc... 110.0 28.2
3 13 Kit Graham Tottenham Hotspur Women Shot From Throw In Center Attacking Midfield False {'shot': {'statsbomb_xg': 0.12462413, 'end_loc... 113.2 40.4
4 16 Kit Graham Tottenham Hotspur Women Shot From Counter Center Attacking Midfield False {'shot': {'statsbomb_xg': 0.02380701, 'end_loc... 95.2 39.8

The column named extra contains additional information describing shot event. This is where we can observe that Statsbomb provides a high level of event data granularity. For example, below you can see that for each shot, the location of all players, specifically opposing team’s goalkeeper, within a visible video frame is recorded. In addition, there is data about body_part with which a shot was implemented, technique (which as per event data description guide is “name of the technique used for this shot”), open_goal which is a boolean variable that describes if a shot was taken with an open goal, follows_dribble which is a boolean variable that describes if a taken shot was followed by dribble or not and first_time which is a boolean variable that describes if a shot was taken with the first touch or not. Due to vendor specifications, only boolean variables with True state appear in extra column; thus, I have to specify False state for all other cases explicitly.

As you can see, variables follows_dribble and open_goal are missing from below instance of extra column due to False state.

shots.loc[:, 'extra'][148]
{'shot': {'open_goal': True,
  'statsbomb_xg': 0.84770715,
  'end_location': [120.0, 39.0, 0.9],
  'body_part': {'id': 40, 'name': 'Right Foot'},
  'type': {'id': 87, 'name': 'Open Play'},
  'outcome': {'id': 97, 'name': 'Goal'},
  'first_time': True,
  'technique': {'id': 91, 'name': 'Half Volley'},
  'freeze_frame': [{'location': [111.4, 38.3],
    'player': {'id': 4647, 'name': 'So-Yun Ji'},
    'position': {'id': 13, 'name': 'Right Center Midfield'},
    'teammate': True},
   {'location': [108.8, 42.9],
    'player': {'id': 4636, 'name': 'Maria Thorisdottir'},
    'position': {'id': 2, 'name': 'Right Back'},
    'teammate': True},
   {'location': [110.1, 53.6],
    'player': {'id': 4961, 'name': 'Samantha May Kerr'},
    'position': {'id': 21, 'name': 'Left Wing'},
    'teammate': True},
   {'location': [106.3, 51.4],
    'player': {'id': 10108, 'name': 'Pernille Mosegaard Harder'},
    'position': {'id': 17, 'name': 'Right Wing'},
    'teammate': True},
   {'location': [110.8, 35.7],
    'player': {'id': 46738, 'name': 'Emma Bissell'},
    'position': {'id': 16, 'name': 'Left Midfield'},
    'teammate': False},
   {'location': [110.1, 39.8],
    'player': {'id': 36801, 'name': 'Aimee Palmer'},
    'position': {'id': 13, 'name': 'Right Center Midfield'},
    'teammate': False},
   {'location': [116.1, 42.8],
    'player': {'id': 16376, 'name': 'Sophie Baggaley'},
    'position': {'id': 1, 'name': 'Goalkeeper'},
    'teammate': False},
   {'location': [116.1, 45.7],
    'player': {'id': 16381, 'name': 'Gemma Evans'},
    'position': {'id': 5, 'name': 'Left Center Back'},
    'teammate': False},
   {'location': [110.5, 46.1],
    'player': {'id': 15618, 'name': 'Jasmine Matthews'},
    'position': {'id': 3, 'name': 'Right Center Back'},
    'teammate': False},
   {'location': [111.2, 50.0],
    'player': {'id': 24922, 'name': 'Florence Allen'},
    'position': {'id': 2, 'name': 'Right Back'},
    'teammate': False},
   {'location': [114.5, 49.3],
    'player': {'id': 24239, 'name': 'Jemma Elizabeth Purfield'},
    'position': {'id': 6, 'name': 'Left Back'},
    'teammate': False}]}}

In addition to above-mentioned data, I also extract contenxtual information from extra column. These are type, statsbomb_xg and outcome variables. The last one is important for knowing if a taken shot results in a goal or not. The variable type will help me to filter out only open play situations and discard outlying conditions where a shot is taken directly from corner, free-kick, penalty or kick-off. These are situations that can largely skew performance of the proposed xG model, and it is better to build a separate model that focuses only on them.

I unpack extra column that consists of dictionaries, extract required data and save it as separate columns in the dataframe.

# convert 'extra' column to dict readable format using ast.literal_eval
shots_extra = shots.loc[:, 'extra'].apply(ast.literal_eval).reset_index(drop = True)

# specify which features to extract from 'extra' column
keys = ['follows_dribble', 'first_time', 'open_goal', 'statsbomb_xg', \
        'type', 'technique', 'body_part', 'outcome']
# save selected features in a dataframe
extra_features = pd.DataFrame(np.nan, columns = keys, index = range(shots.shape[0]))
for i, j in shots_extra.iteritems():
    for k in list(j['shot'].keys()):
        if k in ['type', 'technique', 'body_part', 'outcome']:
            extra_features.loc[i, k] = j['shot'][k]['name']
        elif k in keys:
            extra_features.loc[i, k] = j['shot'][k]
        elif k == 'freeze_frame':
            extra_features.loc[i, k] = [{'freeze_frame':j['shot'][k]}]
        elif (k == 'end_location'):
            extra_features.loc[i, 'end_loc'] = [{'end_loc':j['shot'][k]}]
# fill NAs with boolean = False (technically, these are not NAs but just undeclared False values)
extra_features = extra_features.fillna(value = False)
# transform columns with boolean values into integers 
extra_features.loc[:, ['follows_dribble', 'first_time', 'open_goal']] = \
extra_features.loc[:, ['follows_dribble', 'first_time', 'open_goal']].astype(int)
shots.loc[:, 'under_pressure'] = shots.loc[:, 'under_pressure'].astype(int)
shots = pd.concat([shots.drop(columns = ['extra', 'type_name']), extra_features], axis = 1)
minute player_name team_name play_pattern_name position_name under_pressure x_start y_start follows_dribble first_time open_goal statsbomb_xg type technique body_part outcome end_loc freeze_frame
0 7 Lucy Quinn Tottenham Hotspur Women Regular Play Right Wing 0 95.9 58.9 0 0 0 0.013642 Open Play Normal Left Foot Saved [{'end_loc': [116.7, 44.9, 1.2]}] [{'freeze_frame': [{'location': [119.6, 42.3],...
1 10 Rianna Dean Tottenham Hotspur Women From Free Kick Center Forward 0 106.1 54.3 0 0 0 0.040844 Open Play Normal Right Foot Off T [{'end_loc': [120.0, 41.6, 4.2]}] [{'freeze_frame': [{'location': [118.8, 43.2],...
2 11 Angela Addison Tottenham Hotspur Women From Free Kick Left Wing 1 110.0 28.2 0 0 0 0.136871 Open Play Normal Left Foot Saved [{'end_loc': [117.6, 36.7, 0.4]}] [{'freeze_frame': [{'location': [111.3, 39.8],...
3 13 Kit Graham Tottenham Hotspur Women From Throw In Center Attacking Midfield 0 113.2 40.4 0 0 0 0.124624 Open Play Normal Head Post [{'end_loc': [120.0, 37.9, 2.9]}] [{'freeze_frame': [{'location': [105.8, 46.6],...
4 16 Kit Graham Tottenham Hotspur Women From Counter Center Attacking Midfield 0 95.2 39.8 0 0 0 0.023807 Open Play Normal Left Foot Post [{'end_loc': [120.0, 37.3, 2.9]}] [{'freeze_frame': [{'location': [97.8, 49.4], ...

Also, I would like to extract opposing team’s goalkeeper location during each executed shot. These coordinates are contained in freeze_frame column.

# write a custom function to unpack dictionaries within freeze_frame column
def ff_unpacking(players):
    players = players[0]['freeze_frame']
    for i in players:
        if i['position']['name'] == 'Goalkeeper' and i['teammate'] == False:
            gk_loc = i['location']
            return gk_loc

shots.loc[:, 'gk_loc'] = shots.loc[:, 'freeze_frame'].apply(lambda x: ff_unpacking(x))

# Note that there are 42 None instances where goalkeeper location was incorrectly labeled.
shots = shots.loc[~shots.loc[:, 'gk_loc'].isnull(), :].reset_index(drop = True)

# save (x, y) coordinates of a goalkeeper as separate columns
shots.loc[:, 'gk_loc_x'] = shots.loc[:, 'gk_loc'].apply(lambda x: x[0])
shots.loc[:, 'gk_loc_y'] = shots.loc[:, 'gk_loc'].apply(lambda x: x[1])

Finally, I save my shots dataframe as .csv file.

shots.to_csv('.data/shots.csv', index = False)

This is the end of preprocessing stage for data that will be used in the proposed xG model. Now, we can move on to the model building phase that will focus on exploratory data analysis, feature engineering and model selection.