0

I want to learn Python and have chosen a small private Football Data project for it. I have the following problem: I want to pull the data of the past 4 seasons. This works with the code below so far. But now I want to filter out the teams for each league, which were not in all 4 seasons (the relegated teams should disappear). I have no idea how to do this, because it only works for the individual leagues. So it must be iterated over each season per league and not over all seasons of all leagues.

import pandas as pd
import numpy as np

# leagues for England. E0 is Premier League, E1 is Championship and so on...
leagues = ["E0", "E1", "E2", "E3", "EC"]
seasons = ["2223", "2122", "2021", "1920"]
baseUrl = "https://www.football-data.co.uk/mmz4281/"

urls = []

for league in leagues:
    for season in seasons:
        url = str(baseUrl)+str(season)+"/"+str(league)+".csv"
        urls.append(url)

# load the data.

column_names = ["Div", "HomeTeam", "AwayTeam", "FTHG", "FTAG", "FTR"]

dfs = [pd.read_csv(url, encoding='cp1252', usecols=column_names)
       for url in urls]
df = pd.concat(dfs, ignore_index=True)

So example: If a team is relegated from E0 to E1 in Season 2021, then it will not show up in E0 in Season 2122. If this is the case, all rows in all 4 seasons of E0 where this team appears should be deleted, because I want cleaned data without promotion/relegation.

How can I implement this?

  • Pls explain better. post your data and expected output – gtomer Nov 12 '22 at 20:08
  • the data are posted (code above) and expecte output is descripted below the code (example) – Robin Reiche Nov 12 '22 at 20:13
  • Cant see the expected output – gtomer Nov 12 '22 at 20:14
  • Your question needs a minimal reproducible example consisting of sample input, expected output, actual output, and only the relevant code necessary to reproduce the problem. See [How to make good reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) for best practices related to Pandas questions. – itprorh66 Nov 12 '22 at 20:15

1 Answers1

1

Your code is almost ready. You only need to add a small for-loop filtering teams which played in more than one division:

print(df.shape)
# (8264, 6)
for team in df.HomeTeam.unique():
    played_divs = df[df.HomeTeam==team].Div.unique()
    if len(played_divs) > 1:
        df = df[(df.HomeTeam != team)*(df.AwayTeam != team)]
print(df.shape)
# (2948, 6) (5316 rows were filtered for me)
C-3PO
  • 1,181
  • 9
  • 17
  • thanks for your effort! it goes in the right direction but the code still does not work as it should. the loop has only filtered 147 rows out of a total of 8264. there are 3 relegations every season. with 38 games in the season of the E0 114 rows should be filtered out per season. with 4 seasons of the E0 this should be 456 rows alone. In addition, the same logic for the other 4 leagues. – Robin Reiche Nov 13 '22 at 21:17
  • I just ran the code. For me, it filtered from `8264` rows to `2948` rows. `5316` rows were deleted. Since the code modifies the `dataFrame` itself, maybe if you ran the code twice it appeared to have no effect, or we are using different version of `pandas` or something else. – C-3PO Nov 13 '22 at 21:47
  • I edited the code, did you run the last version? – C-3PO Nov 13 '22 at 21:49
  • you are right. I had made a mistake in the execution! Thank you very much – Robin Reiche Nov 14 '22 at 08:02