The Beginner’s Trap

The stages in the analytics process is filled with moments of success and failures. There are some instant gratifications during the process, where a begginer like myself might construe a non success as success, due to some kind of judgement error.

This project will go through one of those instances and discuss some of the things I need to keep in mind so that I (and people new to analytics) do not make these mistakes.

The dataset is from https://www.kaggle.com/sanjeetsinghnaik/most-expensive-footballers-2021

Since learning is a never ending process, I will be updating the notebook (or creating a variation of it) as I discover more best practices. So, consider this to be an excerise on SOTA (State of the art) POC (proof of concept).

Steps involved:

  1. Introduction
  2. Exploratory Data Analysis
  3. Feature Engineering
  4. Linear Regression
  5. Decision Tree Regression
  6. Validation
  7. Conclusion

1. Introduction

The description of the dataset from Kaggle:

This file consists of data of Top 500 Most Expensive Footballer In 2021. The data is according to the prices listed in transfer market along with data like goals, assists, matches, age, etc.

The question we are trying to answer is:

** Based on different predictors (goals, assists, matches, age etc.) is it possible to predict the market value of footballers? **

The source of the image above is the Kaggle repository hosting the dataset.

Imports

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(rc={'figure.figsize':(12, 6)})
from matplotlib.ticker import MaxNLocator
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import cross_val_score

The dataset is in the file players_data.csv

df = pd.read_csv('players_data.csv')

Let us take a look at the columns. We have a few ways of doing this but I always found using list easy.

list(df)
['Name',
 'Position',
 'Age',
 'Markey Value In Millions(£)',
 'Country',
 'Club',
 'Matches',
 'Goals',
 'Own Goals',
 'Assists',
 'Yellow Cards',
 'Second Yellow Cards',
 'Red Cards',
 'Number Of Substitute In',
 'Number Of Substitute Out']

Since we know the dataset is a ranking of 500 highest paid footballers, the expected rows should be 500. And the number of columns is 15

df.shape
(500, 15)

Let’s look at the first few rows.

df.head()

Name Position Age Markey Value In Millions(£) Country Club Matches Goals Own Goals Assists Yellow Cards Second Yellow Cards Red Cards Number Of Substitute In Number Of Substitute Out
0 Kylian Mbappé Centre-Forward 22 144.0 France Paris Saint-Germain 16 7 0 11 3 0 0 0 8
1 Erling Haaland Centre-Forward 21 135.0 Norway Borussia Dortmund 10 13 0 4 1 0 0 0 1
2 Harry Kane Centre-Forward 28 108.0 England Tottenham Hotspur 16 7 0 2 2 0 0 2 2
3 Jack Grealish Left Winger 26 90.0 England Manchester City 15 2 0 3 1 0 0 2 8
4 Mohamed Salah Right Winger 29 90.0 Egypt Liverpool FC 15 15 0 6 1 0 0 0 3

Let’s look at the last few rows.

df.tail()

Name Position Age Markey Value In Millions(£) Country Club Matches Goals Own Goals Assists Yellow Cards Second Yellow Cards Red Cards Number Of Substitute In Number Of Substitute Out
495 Giorgian de Arrascaeta Attacking Midfield 27 16.2 Uruguay Clube de Regatas do Flamengo 0 0 0 0 0 0 0 0 0
496 Ayoze Pérez Second Striker 28 16.2 Spain Leicester City 8 1 0 3 0 0 1 2 5
497 Alex Meret Goalkeeper 24 16.2 Italy SSC Napoli 5 0 0 0 0 0 0 0 0
498 Duje Caleta-Car Centre-Back 25 16.2 Croatia Olympique Marseille 8 0 0 0 2 0 0 0 2
499 Aritz Elustondo Centre-Back 27 16.2 Spain Real Sociedad 15 3 0 1 4 0 0 1 1

2. Exploratory Data Analysis

Let’s use Pandas built in function to generate descriptive statistics

df.describe()

Age Markey Value In Millions(£) Matches Goals Own Goals Assists Yellow Cards Second Yellow Cards Red Cards Number Of Substitute In Number Of Substitute Out
count 500.000000 500.000000 500.000000 500.000000 500.000000 500.00000 500.000000 500.000000 500.000000 500.000000 500.000000
mean 24.968000 31.537800 12.396000 2.160000 0.030000 1.51200 1.592000 0.036000 0.046000 2.394000 3.744000
std 3.165916 17.577697 4.342453 2.880102 0.170758 1.85276 1.445585 0.186477 0.209695 2.517825 3.293046
min 16.000000 16.200000 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 23.000000 19.800000 10.000000 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000 1.000000
50% 25.000000 25.200000 13.000000 1.000000 0.000000 1.00000 1.000000 0.000000 0.000000 2.000000 3.000000
75% 27.000000 36.000000 16.000000 3.000000 0.000000 2.00000 2.000000 0.000000 0.000000 3.250000 6.000000
max 36.000000 144.000000 24.000000 23.000000 1.000000 12.00000 7.000000 1.000000 1.000000 13.000000 20.000000

Let us look into the data types for each of the columns

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 15 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   Name                         500 non-null    object 
 1   Position                     500 non-null    object 
 2   Age                          500 non-null    int64  
 3   Markey Value In Millions(£)  500 non-null    float64
 4   Country                      500 non-null    object 
 5   Club                         500 non-null    object 
 6   Matches                      500 non-null    int64  
 7   Goals                        500 non-null    int64  
 8   Own Goals                    500 non-null    int64  
 9   Assists                      500 non-null    int64  
 10  Yellow Cards                 500 non-null    int64  
 11  Second Yellow Cards          500 non-null    int64  
 12  Red Cards                    500 non-null    int64  
 13  Number Of Substitute In      500 non-null    int64  
 14  Number Of Substitute Out     500 non-null    int64  
dtypes: float64(1), int64(10), object(4)
memory usage: 58.7+ KB

Count of missing values

df.isna().sum()
Name                           0
Position                       0
Age                            0
Markey Value In Millions(£)    0
Country                        0
Club                           0
Matches                        0
Goals                          0
Own Goals                      0
Assists                        0
Yellow Cards                   0
Second Yellow Cards            0
Red Cards                      0
Number Of Substitute In        0
Number Of Substitute Out       0
dtype: int64
df.isna().sum().sum()
0

We do not have any values missing from the dataset.

df.groupby('Club').size().sort_values(ascending=False).head(10)
Club
Manchester United      19
Manchester City        18
Chelsea FC             16
Tottenham Hotspur      16
Real Madrid            16
Paris Saint-Germain    16
RB Leipzig             15
Arsenal FC             15
Liverpool FC           15
Atlético de Madrid     15
dtype: int64
ax = sns.countplot(x='Club', data=df, order=df.Club.value_counts().iloc[:10].index) 
ax.tick_params(axis='x', rotation=45)
ax.yaxis.set_major_locator(MaxNLocator(integer=True))
ax.set(xlabel='Club', ylabel='Count of players')
plt.title("Top 10 Clubs based on player representation (count)")
plt.show()

png

club_grouped = df.groupby(['Club'])[['Markey Value In Millions(£)']].sum().sort_values(by='Markey Value In Millions(£)', ascending=False).head(10)
club_grouped

Markey Value In Millions(£)
Club
Manchester City 940.5
Paris Saint-Germain 775.8
Manchester United 760.5
Chelsea FC 709.2
Bayern Munich 685.8
Liverpool FC 681.3
Atlético de Madrid 616.5
Real Madrid 594.0
Tottenham Hotspur 536.4
Juventus FC 450.0
ax = sns.barplot(x=club_grouped.index, y=club_grouped['Markey Value In Millions(£)']) 
ax.tick_params(axis='x', rotation=45)
ax.set(xlabel='Club', ylabel='Markey Value In Millions(£)')
plt.title("Top 10 Clubs based on total market value")
plt.show()

png

df.groupby('Country').size().sort_values(ascending=False).head(10)
Country
England        67
France         58
Spain          52
Brazil         41
Germany        29
Portugal       26
Italy          26
Argentina      22
Netherlands    17
Belgium        14
dtype: int64
ax = sns.countplot(x='Country', data=df, order=df.Country.value_counts().iloc[:10].index) 
ax.tick_params(axis='x', rotation=45)
ax.yaxis.set_major_locator(MaxNLocator(integer=True))
ax.set(xlabel='Country', ylabel='Count of players')
plt.title("Top 10 Country based on player representation (count)")
plt.show()

png

country_grouped = df.groupby(['Country'])[['Markey Value In Millions(£)']].sum().sort_values(by='Markey Value In Millions(£)', ascending=False).head(10)
country_grouped

Markey Value In Millions(£)
Country
England 2248.2
France 1895.4
Spain 1565.1
Brazil 1275.3
Germany 1005.3
Portugal 890.1
Italy 854.1
Argentina 650.7
Netherlands 571.5
Belgium 522.9
ax = sns.barplot(x=country_grouped.index, y=country_grouped['Markey Value In Millions(£)']) 
ax.tick_params(axis='x', rotation=45)
ax.set(xlabel='Country', ylabel='Markey Value In Millions(£)')
plt.title("Top 10 Country based on total market value")
plt.show()

png

df.groupby('Position').size().sort_values(ascending=False)
Position
Centre-Back           87
Central Midfield      74
Centre-Forward        70
Right Winger          48
Left Winger           46
Attacking Midfield    41
Defensive Midfield    41
Right-Back            30
Left-Back             23
Goalkeeper            19
Left Midfield          8
Second Striker         8
Right Midfield         5
dtype: int64
ax = sns.countplot(x='Position', data=df, order=df.Position.value_counts().iloc[:10].index) 
ax.tick_params(axis='x', rotation=45)
ax.yaxis.set_major_locator(MaxNLocator(integer=True))
ax.set(xlabel='Position', ylabel='Count of players')
plt.title("Top 10 Position based on player representation (count)")
plt.show()

png

position_grouped = df.groupby(['Position'])[['Markey Value In Millions(£)']].sum().sort_values(by='Markey Value In Millions(£)', ascending=False).head(10)
position_grouped

Markey Value In Millions(£)
Position
Centre-Back 2583.9
Central Midfield 2421.9
Centre-Forward 2369.7
Left Winger 1647.0
Right Winger 1461.6
Attacking Midfield 1332.0
Defensive Midfield 1275.3
Right-Back 784.8
Left-Back 704.7
Goalkeeper 585.9
ax = sns.barplot(x=position_grouped.index, y=position_grouped['Markey Value In Millions(£)']) 
ax.tick_params(axis='x', rotation=45)
ax.set(xlabel='Position', ylabel='Markey Value In Millions(£)')
plt.title("Top 10 Position based on total market value")
plt.show()

png

3. Feature Engineering

Since we only have 500 rows of data, 10 different positions, let us aggregate some of these positions together.

I created a simple position mapping and transformed it so that it was possible to use map

position_mapping = {
"forward" : ['Centre-Forward', 'Second Striker'], 
"midfield" : ['Central Midfield', 'Attacking Midfield', 'Defensive Midfield', 'Left Midfield', 'Right Midfield', 'Right Winger', 'Left Winger'],
"back" : ['Centre-Back', 'Right-Back', 'Left-Back']
}

The only position we have not mapped here is Goalkeeper, which we will remove later on.

mapping_dict = {}
for key in position_mapping:
    for item in position_mapping[key]:
        mapping_dict[item] = key
mapping_dict
{'Centre-Forward': 'forward',
 'Second Striker': 'forward',
 'Central Midfield': 'midfield',
 'Attacking Midfield': 'midfield',
 'Defensive Midfield': 'midfield',
 'Left Midfield': 'midfield',
 'Right Midfield': 'midfield',
 'Right Winger': 'midfield',
 'Left Winger': 'midfield',
 'Centre-Back': 'back',
 'Right-Back': 'back',
 'Left-Back': 'back'}

I could have just typed in the mapping_dict directly but this felt more natural.

df['Position_Agg'] = df.Position.map(mapping_dict) 
df

Name Position Age Markey Value In Millions(£) Country Club Matches Goals Own Goals Assists Yellow Cards Second Yellow Cards Red Cards Number Of Substitute In Number Of Substitute Out Position_Agg
0 Kylian Mbappé Centre-Forward 22 144.0 France Paris Saint-Germain 16 7 0 11 3 0 0 0 8 forward
1 Erling Haaland Centre-Forward 21 135.0 Norway Borussia Dortmund 10 13 0 4 1 0 0 0 1 forward
2 Harry Kane Centre-Forward 28 108.0 England Tottenham Hotspur 16 7 0 2 2 0 0 2 2 forward
3 Jack Grealish Left Winger 26 90.0 England Manchester City 15 2 0 3 1 0 0 2 8 midfield
4 Mohamed Salah Right Winger 29 90.0 Egypt Liverpool FC 15 15 0 6 1 0 0 0 3 midfield
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
495 Giorgian de Arrascaeta Attacking Midfield 27 16.2 Uruguay Clube de Regatas do Flamengo 0 0 0 0 0 0 0 0 0 midfield
496 Ayoze Pérez Second Striker 28 16.2 Spain Leicester City 8 1 0 3 0 0 1 2 5 forward
497 Alex Meret Goalkeeper 24 16.2 Italy SSC Napoli 5 0 0 0 0 0 0 0 0 NaN
498 Duje Caleta-Car Centre-Back 25 16.2 Croatia Olympique Marseille 8 0 0 0 2 0 0 0 2 back
499 Aritz Elustondo Centre-Back 27 16.2 Spain Real Sociedad 15 3 0 1 4 0 0 1 1 back

500 rows × 16 columns

Now we can remove goalkeepers (which are NaNs because of missing mapping)

df.dropna(inplace=True)

Let’s create similar charts for the new column as well

df.groupby('Position_Agg').size().sort_values(ascending=False)
Position_Agg
midfield    263
back        140
forward      78
dtype: int64
ax = sns.countplot(x='Position_Agg', data=df, order=df.Position_Agg.value_counts().iloc[:10].index) 
ax.tick_params(axis='x', rotation=45)
ax.yaxis.set_major_locator(MaxNLocator(integer=True))
ax.set(xlabel='Position', ylabel='Count of players')
plt.title("Top 10 Position based on player representation")
plt.show()

png

position_agg_grouped = df.groupby(['Position_Agg'])[['Markey Value In Millions(£)']].sum().sort_values(by='Markey Value In Millions(£)', ascending=False).head(10)
position_agg_grouped

Markey Value In Millions(£)
Position_Agg
midfield 8460.0
back 4073.4
forward 2649.6
ax = sns.barplot(x=position_agg_grouped.index, y=position_agg_grouped['Markey Value In Millions(£)']) 
ax.tick_params(axis='x', rotation=45)
ax.set(xlabel='Position', ylabel='Markey Value In Millions(£)')
plt.title("Top 10 Position based on total market value")
plt.show()

png

4. Linear Regression

First of all, let us only focus on the midfield players. Midfield is the largest group of players we have, and it might make more sense to use regression the same group of players.

df_mid = df[df.Position_Agg=='midfield']
df_mid.reset_index(drop=True, inplace=True)
list(df_mid)
['Name',
 'Position',
 'Age',
 'Markey Value In Millions(£)',
 'Country',
 'Club',
 'Matches',
 'Goals',
 'Own Goals',
 'Assists',
 'Yellow Cards',
 'Second Yellow Cards',
 'Red Cards',
 'Number Of Substitute In',
 'Number Of Substitute Out',
 'Position_Agg']
df_mid = df_mid[['Age', 'Goals', 'Assists', 'Markey Value In Millions(£)']]
df_mid = df_mid.apply(lambda x: x/x.max(), axis=0)
df_mid

Age Goals Assists Markey Value In Millions(£)
0 0.764706 0.133333 0.3 1.00
1 0.852941 1.000000 0.6 1.00
2 0.882353 0.200000 0.1 1.00
3 0.852941 0.200000 0.3 1.00
4 0.617647 0.000000 0.0 0.90
... ... ... ... ...
258 0.676471 0.066667 0.1 0.18
259 0.764706 0.000000 0.0 0.18
260 0.823529 0.000000 0.1 0.18
261 0.764706 0.200000 0.1 0.18
262 0.794118 0.000000 0.0 0.18

263 rows × 4 columns

X = df_mid.iloc[:, :-1]
y = df_mid.iloc[:, -1]
linear_regressor = LinearRegression().fit(X, y)
linear_regressor.score(X, y)
0.06775382996413992
linear_regressor.predict(X)
array([0.37728587, 0.61138884, 0.3870485 , 0.40721923, 0.28447775,
       0.326131  , 0.4879743 , 0.40387802, 0.36451519, 0.34205403,
       0.4380145 , 0.42209148, 0.26836585, 0.40466779, 0.3113309 ,
       0.39576034, 0.35685413, 0.39922502, 0.36962481, 0.31978173,
       0.3962891 , 0.36759539, 0.3706756 , 0.41769948, 0.37093661,
       0.47298534, 0.36451519, 0.34021348, 0.45451087, 0.33229141,
       0.30798969, 0.39103519, 0.34179302, 0.30058964, 0.34866431,
       0.5442388 , 0.41146694, 0.34021348, 0.31415009, 0.32823257,
       0.38036608, 0.34637388, 0.34663489, 0.3274428 , 0.37473444,
       0.34716365, 0.33871282, 0.49826569, 0.30904047, 0.32718179,
       0.30798969, 0.30058964, 0.38370729, 0.45300347, 0.54202051,
       0.32823257, 0.38573671, 0.408009  , 0.39110734, 0.45484402,
       0.39549933, 0.38167787, 0.39437641, 0.31415009, 0.42385989,
       0.35174451, 0.3144111 , 0.4689265 , 0.32921121, 0.39523832,
       0.33255242, 0.37296602, 0.31644052, 0.3422429 , 0.33589363,
       0.36451519, 0.35174451, 0.36680562, 0.32410158, 0.37525645,
       0.40616845, 0.39136835, 0.32181115, 0.36248577, 0.28447775,
       0.32076037, 0.42183047, 0.31106989, 0.31212067, 0.31952073,
       0.36222476, 0.38986769, 0.33995247, 0.30596027, 0.36346441,
       0.32718179, 0.30058964, 0.36019534, 0.35711514, 0.3484033 ,
       0.4241209 , 0.30032863, 0.41567006, 0.36969696, 0.32384057,
       0.36556597, 0.37525645, 0.36988582, 0.31670153, 0.40669721,
       0.31741916, 0.36444305, 0.34401131, 0.33890169, 0.40590744,
       0.31336032, 0.36654461, 0.32947222, 0.31336032, 0.34637388,
       0.36248577, 0.30596027, 0.35711514, 0.35711514, 0.32410158,
       0.36346441, 0.40079782, 0.38265651, 0.34761353, 0.37859767,
       0.35200552, 0.29829921, 0.39294789, 0.30261906, 0.30058964,
       0.3144111 , 0.38573671, 0.35050486, 0.40151545, 0.34100325,
       0.31336032, 0.37735802, 0.27373649, 0.3144111 , 0.330523  ,
       0.27576591, 0.26836585, 0.29521901, 0.41638769, 0.4781671 ,
       0.40571857, 0.43238286, 0.31873095, 0.39726774, 0.34637388,
       0.42288125, 0.3113309 , 0.42175832, 0.32286194, 0.35050486,
       0.32286194, 0.30596027, 0.43257173, 0.30058964, 0.45052417,
       0.35174451, 0.39313676, 0.38802714, 0.34126426, 0.40079782,
       0.3678564 , 0.40721923, 0.34558411, 0.34558411, 0.36222476,
       0.3366834 , 0.47095592, 0.38900577, 0.38960668, 0.3706756 ,
       0.4166487 , 0.34663489, 0.30569926, 0.29724843, 0.35482471,
       0.36962481, 0.48876408, 0.29829921, 0.33255242, 0.3422429 ,
       0.28984838, 0.30904047, 0.34408345, 0.40590744, 0.30904047,
       0.31336032, 0.34637388, 0.33589363, 0.31670153, 0.36248577,
       0.34558411, 0.3274428 , 0.33897383, 0.3854757 , 0.47167354,
       0.28650717, 0.28447775, 0.3484033 , 0.29521901, 0.30261906,
       0.30366984, 0.35685413, 0.32384057, 0.32718179, 0.40676935,
       0.38678749, 0.33995247, 0.28984838, 0.3144111 , 0.29521901,
       0.30596027, 0.31873095, 0.31336032, 0.3113309 , 0.30798969,
       0.31873095, 0.33484285, 0.33890169, 0.35174451, 0.39005656,
       0.37191524, 0.31670153, 0.41893913, 0.3113309 , 0.40827001,
       0.40086996, 0.34021348, 0.35940557, 0.37525645, 0.34126426,
       0.38494693, 0.39136835, 0.41488029, 0.35606436, 0.31415009,
       0.36943595, 0.31846994, 0.38141686, 0.36864618, 0.28447775,
       0.31538974, 0.29521901, 0.36969696, 0.30904047, 0.330523  ,
       0.39850739, 0.31336032, 0.30596027, 0.32181115, 0.3113309 ,
       0.33484285, 0.36556597, 0.31670153])

5. Decision Tree Regressor

decision_tree_regressor = DecisionTreeRegressor(random_state=42)
decision_tree_regressor.fit(X, y)
DecisionTreeRegressor(random_state=42)
decision_tree_regressor.predict(X)
array([1.        , 1.        , 1.        , 0.625     , 0.4075    ,
       0.9       , 0.9       , 0.9       , 0.66666667, 0.85      ,
       0.85      , 0.85      , 0.525     , 0.8       , 0.326     ,
       0.8       , 0.475     , 0.7       , 0.46      , 0.7       ,
       0.7       , 0.7       , 0.465     , 0.7       , 0.7       ,
       0.7       , 0.66666667, 0.5       , 0.65      , 0.65      ,
       0.45      , 0.65      , 0.6       , 0.396     , 0.6       ,
       0.6       , 0.6       , 0.5       , 0.38333333, 0.5       ,
       0.55      , 0.33      , 0.39      , 0.385     , 0.5       ,
       0.5       , 0.5       , 0.5       , 0.28      , 0.35      ,
       0.45      , 0.396     , 0.5       , 0.48      , 0.47      ,
       0.5       , 0.365     , 0.45      , 0.45      , 0.45      ,
       0.45      , 0.45      , 0.45      , 0.38333333, 0.42      ,
       0.3175    , 0.29      , 0.42      , 0.4       , 0.4       ,
       0.31      , 0.4       , 0.4       , 0.31      , 0.31      ,
       0.66666667, 0.3175    , 0.4       , 0.35      , 0.30666667,
       0.4       , 0.3       , 0.28      , 0.3       , 0.4075    ,
       0.35      , 0.35      , 0.35      , 0.35      , 0.35      ,
       0.295     , 0.35      , 0.275     , 0.256     , 0.325     ,
       0.35      , 0.396     , 0.35      , 0.31666667, 0.275     ,
       0.35      , 0.33      , 0.33      , 0.25      , 0.26      ,
       0.25      , 0.30666667, 0.32      , 0.23      , 0.3       ,
       0.3       , 0.3       , 0.3       , 0.25      , 0.26      ,
       0.245     , 0.3       , 0.3       , 0.245     , 0.33      ,
       0.3       , 0.256     , 0.31666667, 0.31666667, 0.35      ,
       0.325     , 0.275     , 0.3       , 0.3       , 0.3       ,
       0.3       , 0.25      , 0.28      , 0.24      , 0.396     ,
       0.29      , 0.365     , 0.26      , 0.27      , 0.27      ,
       0.245     , 0.27      , 0.26      , 0.29      , 0.22      ,
       0.25      , 0.525     , 0.2075    , 0.25      , 0.25      ,
       0.25      , 0.25      , 0.21666667, 0.25      , 0.33      ,
       0.25      , 0.326     , 0.25      , 0.25      , 0.26      ,
       0.25      , 0.256     , 0.25      , 0.396     , 0.25      ,
       0.3175    , 0.25      , 0.25      , 0.225     , 0.275     ,
       0.25      , 0.625     , 0.24      , 0.24      , 0.295     ,
       0.24      , 0.24      , 0.24      , 0.23      , 0.465     ,
       0.23      , 0.39      , 0.22      , 0.22      , 0.22      ,
       0.46      , 0.22      , 0.25      , 0.31      , 0.31      ,
       0.21      , 0.28      , 0.22      , 0.26      , 0.28      ,
       0.245     , 0.33      , 0.31      , 0.23      , 0.3       ,
       0.24      , 0.385     , 0.21      , 0.2       , 0.2       ,
       0.2       , 0.4075    , 0.275     , 0.2075    , 0.24      ,
       0.2       , 0.475     , 0.26      , 0.35      , 0.2       ,
       0.2       , 0.275     , 0.21      , 0.29      , 0.2075    ,
       0.256     , 0.21666667, 0.245     , 0.326     , 0.45      ,
       0.21666667, 0.19      , 0.25      , 0.3175    , 0.2       ,
       0.2       , 0.23      , 0.2       , 0.326     , 0.2       ,
       0.2       , 0.5       , 0.2       , 0.30666667, 0.225     ,
       0.2       , 0.3       , 0.2       , 0.19      , 0.38333333,
       0.18      , 0.18      , 0.18      , 0.18      , 0.4075    ,
       0.18      , 0.2075    , 0.25      , 0.28      , 0.22      ,
       0.18      , 0.245     , 0.256     , 0.28      , 0.326     ,
       0.19      , 0.25      , 0.23      ])
decision_tree_regressor.score(X, y)
0.7410570792901311

6. Validation

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)
decision_tree_regressor = DecisionTreeRegressor(random_state=42)
decision_tree_regressor.fit(X_train, y_train)
DecisionTreeRegressor(random_state=42)
decision_tree_regressor.score(X_train, y_train)
0.7325965837309238
decision_tree_regressor.score(X_test, y_test)
-1.0994712392955237

7. Conclusion

The Trap

The trap here is using same data for training and validation. As a beginner, it is common to look into a few models (here we looked into Linear Regression and Decision Tree Regressor), and concluded one is better than the other because the latter had a good determination coefficient score.

However, after validation we realize that the Decision Tree Regressor is up to no good.

https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html#sklearn.tree.DecisionTreeRegressor.score

The score is negative, and based on the documentation above:

The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse)

This is just a very basic example of misjudgement on my side. As a peer reviewer, you might feel like this is something very trivial. Finally, I just demonstrated one example here. There are pleanty of these in analytics.