"If football were a formal language, it would be the most popular in the world..."

Within this work, I combine two of my biggest passions - football and data science -  and try to model the language we love and use every day: the language of football. Leveraging state-of-the-art algorithms and StatsBomb data, I use the model to describe actions and players, and understand their place in the football semantic space. It is the first step in research yet to be continued to model this world. This whole process was pure fun, and I hope you’ll enjoy reading it.

It starts with football events data
openly available by StatsBomb

table_full.png

Each event is a row within a DataFrame

> 3M   Events
900 Matches

table_selected.png

Events are interpreted

& encoded into words

pitch.png
green_area.png
pass_arrow.png

Action type

horizontal_lines.png
vertical_lines.png

Location​​

Action attributes​​

axes.png
grid.png

4

< Pass > 

(3, 4)

digonal | on-ground medium length

3

Words are encoded & fed into models

Powerful Nature Language Processing Models

Action2Vec

Player2Vec

gensim.png

powered by

Pass to back

Cross to box

Shot saved

right-arrow.png
right-arrow.png
right-arrow.png
players_embeddings_0000s_0002_....png
right-arrow.png
right-arrow.png
right-arrow.png
players_embeddings_0000s_0002s_0000_Layer-7.png
embeddings_0000s_0001s_0002_Layer-6_edit
embeddings_0000s_0001s_0001_Layer-5_edit
process.png

Leo Messi

Erling Haaland

Virgil Van Dijk

right-arrow.png
right-arrow.png
right-arrow.png
players_embeddings_0000s_0002_....png
right-arrow.png
players_embeddings_0000s_0001s_0000_Layer-4.png
right-arrow.png
players_embeddings_0000s_0001s_0001_Layer-5.png
right-arrow.png
players_embeddings_0000s_0001s_0002_Layer-6.png

Player-matches

aggregation

Transforming each action to a 32-dim vector, based on football semantics

Transforming each player to a 32-dim vector, based on his actions during games

 
 

Evaluate, interpret, explain

Understand players with players similarities & equations

Selected player: Andres Iniesta

Most similar player: Arthur Melo

Equations:

  • Iniesta  + outbox scoring 

 + inbox scoring

            ~ Kevin De Bruyne

  • Iniesta + inbox scoring

            ~ Eden Hazard

  • Iniesta + outbox scoring 

- dribbling

            ~ Toni Kroos

Use players variations to better understand representation’s vector dimension

dimensions-std-over-variations_modify.png
dimensions-std-over-variations_enrich.png

Actions analogies as a tool to investigate model semantics

dashed_green_arrow.png
legend.jpg

Illustrative analogy plot for learning pass direction. B1/2/3 are the best actions to fit the analogy equation: A - A’ + B’ =?. Solid lines represent A or B, while dashed lines represent A’ or B’. Green colors are for A, A’, reds for B, B’. The pass distance (short/med/long) is represented by the arrow length. Here, A’ is the same pass as A, but with the opposite direction (left). B’ is the same as A’ from one position behind. B1/2 are mirrored passes to B with variations of height and length. B3 is exactly the mirrored pass. 

red_dashed_arrow.png
curved_red_arrow.png
stright_red_arrow.png
pitch.png
Untitled-1_0000s_0002_Vector-Smart-Object_edited.png
curved_short_red_arrow.png

Combine players with actions to generate endless local variations of players

dembele_variations.png

Explore the Player2Vec embeddings space

Present &

Use out-of-the-box

Stunning UI

powered by

Streamlit & Plotly

streamlit.png
plotly.png
information_section.png
Screen Shot 2021-10-02 at 19.58.13.png
actions heatmap gif
skills gif
 

Interact

 
 

Interactive Charts

A Gensim Word2Vec model which allows embedding the semantics of the football language in a 32-dimensional space.

Action2Vec

UMAP projections of the complete all Action2Vec vocabulary.

UMAP projections of all players within all matches, according to StatsBomb open dataset.

A Gensim Doc2Vec model that produces players embedding within a single match in a 32-dimensional space, based on the actions performed by the player.

PlayerMatch2Vec

Player2Vec is the core of this project. It is, in fact, the averaged representation of PlayerMatch2Vec representations.

Player2Vec

Plotly interactive UMAP projection of Player2vec where all player’s matches are averaged to a single vector. Players are colored by position.