"If football were a formal language, it would be the most popular in the world..."
Within this work, I combine two of my biggest passions - football and data science - and try to model the language we love and use every day: the language of football. Leveraging state-of-the-art algorithms and StatsBomb data, I use the model to describe actions and players, and understand their place in the football semantic space. It is the first step in research yet to be continued to model this world. This whole process was pure fun, and I hope you’ll enjoy reading it.
It starts with football events data
openly available by StatsBomb
Each event is a row within a DataFrame
> 3M Events
~ 900 Matches
Events are interpreted
& encoded into words
< Pass >
digonal | on-ground | medium length
Words are encoded & fed into models
Powerful Nature Language Processing Models
Pass to back
Cross to box
Virgil Van Dijk
Transforming each action to a 32-dim vector, based on football semantics
Transforming each player to a 32-dim vector, based on his actions during games
Evaluate, interpret, explain
Understand players with players similarities & equations
Selected player: Andres Iniesta
Most similar player: Arthur Melo
Iniesta + outbox scoring
+ inbox scoring
~ Kevin De Bruyne
Iniesta + inbox scoring
~ Eden Hazard
Iniesta + outbox scoring
~ Toni Kroos
Use players variations to better understand representation’s vector dimension
Actions analogies as a tool to investigate model semantics
Illustrative analogy plot for learning pass direction. B1/2/3 are the best actions to fit the analogy equation: A - A’ + B’ =?. Solid lines represent A or B, while dashed lines represent A’ or B’. Green colors are for A, A’, reds for B, B’. The pass distance (short/med/long) is represented by the arrow length. Here, A’ is the same pass as A, but with the opposite direction (left). B’ is the same as A’ from one position behind. B1/2 are mirrored passes to B with variations of height and length. B3 is exactly the mirrored pass.
Combine players with actions to generate endless local variations of players
Explore the Player2Vec embeddings space
Streamlit & Plotly
A Gensim Word2Vec model which allows embedding the semantics of the football language in a 32-dimensional space.
UMAP projections of the complete all Action2Vec vocabulary.
UMAP projections of all players within all matches, according to StatsBomb open dataset.
A Gensim Doc2Vec model that produces players embedding within a single match in a 32-dimensional space, based on the actions performed by the player.
Player2Vec is the core of this project. It is, in fact, the averaged representation of PlayerMatch2Vec representations.
Plotly interactive UMAP projection of Player2vec where all player’s matches are averaged to a single vector. Players are colored by position.