How to take specific columns in pandas dataframe only if they exist (different CSVs)

Question

I downloaded a bunch of football data from the internet in order to analyze it (around 30 CSV files). Each season's game data is saved as a CSV file with different data columns. Some data columns are common to all files e.g. Home team, Away team, Full time result, ref name, etc...

Earlier years CSV data columns picture - These column are common to all CSVs

However in more recent years the data became richer and has some new data columns e.g. Corners for home team, Corners for away team, yellow cards for each team, shots on goal for each side, etc...

Recent years CSV data columns picture - Contains the common columns as well as additional ones

I made a generic function that take each season's CSV gameweek data and turns it into a full table (how it looked at the end of the season) with different stats. Now when I try to build the "final-day" table of each season from the common data columns alone everything works out fine, However, when I try to throw in the uncommon columns (corners for example) I get an error. This is no surprise to me and i know how to check whether a CSV includes a certain column, but i'd like to know if there is a clever way to command the dataset to take a certain column if it exists (say 'Corners') and just skip this column if it does not exist.

I present part of the function that does rises the error. the last line is the problematic one. When I use leave only the common columns in (i.e. deleting every column after FTR) the function works fine. The code in general gets 1 season at the time and builds the table.

# create a pandas dataframe of a specific season before the season started
# returns a pandas dataframe with the year of the season and the teams involved with initialized stats
# path is the full path of the file retained by glob function, and raw_data is a pandas dataframe read directly from the CSV
def create_initial_table(path, raw_data):
# extracts the season's year
season_number = path[path.index("/") + 1:path.index(".")]
# reduce the information to the relevant columns
raw_data = raw_data[['HomeTeam', 'AwayTeam', 'FTHG', 'FTAG', 'FTR', 'HS', 'AS', 'HST', 'AST', 'HC', 'AC']]

with this constellation i'd like to continue. i.e. when the column name does not exist just skip to the next column and so on till the ones that do exist remains and the ones that arent wont rise an error.

In later functions i also update the values of these columns (corners, shots on goal, etc...), so the same skip functionallity is needed there too.

Thanks for the advices :>

Please do not share information as images unless absolutely necessary. See: https://meta.stackoverflow.com/questions/303812/discourage-screenshots-of-code-and-or-errors, https://idownvotedbecau.se/imageofcode, https://idownvotedbecau.se/imageofanexception/. — AMC, Oct 08 '20 at 21:58
Does this answer your question? [How to check if a column exists in Pandas](https://stackoverflow.com/questions/24870306/how-to-check-if-a-column-exists-in-pandas) — AMC, Oct 08 '20 at 21:58
Not quite @AMC. I dont to actively check if the column exist, i want to be able to take certain columns only if they exist, otherwise dont take it. What can happen is that at the end of current season a new state will be kept track (say goalkeeper's saves) and i'd like to easily push it in the code without it affecting previous work. — Aqua-, Oct 08 '20 at 23:29

score 0 · Accepted Answer · answered Oct 08 '20 at 21:57

0

You can use DataFrame.filter(items=...) see this example:

all_columns = ['HomeTeam', 'AwayTeam', 'FTHG', 'FTAG', 'FTR', 'HS', 'AS', 'HST', 'AST', 'HC', 'AC']
df = pd.DataFrame(np.random.rand(5, 5), columns=['HomeTeam', 'AwayTeam', 'FTHG', 'FTAG', 'BAD COLUMN'])

print(df)
   HomeTeam  AwayTeam      FTHG      FTAG  BAD COLUMN
0  0.265389  0.523248  0.093941  0.575946    0.929296
1  0.318569  0.667410  0.131798  0.716327    0.289406
2  0.183191  0.586513  0.020108  0.828940    0.004695
3  0.677817  0.270008  0.735194  0.962189    0.248753
4  0.576157  0.592042  0.572252  0.223082    0.952749

Even though I feed it column names that don't exist in the dataframe, it will only pull out the columns that exist

new_df = df.filter(items=all_columns)

print(new_df)
   HomeTeam  AwayTeam      FTHG      FTAG
0  0.265389  0.523248  0.093941  0.575946
1  0.318569  0.667410  0.131798  0.716327
2  0.183191  0.586513  0.020108  0.828940
3  0.677817  0.270008  0.735194  0.962189
4  0.576157  0.592042  0.572252  0.223082

answered Oct 08 '20 at 21:57

Cameron Riddell

10,942
9
19

Great @Cameron! It seems like what I need. Will try and come back to accept the answer if im successful. Any idea about cell updating? Say you had 2 DataFrames like the ones you created - one like the last one you printed, and another with an additional column - say HS. id like my generic function to keep count of those shots made by each team, however only one of the DF contain it. I'd like my function to update the cell if the column exist and do nothing if it doesnt. An example update line: (fixture is a row in the CSV) `season_tbl.at[fixture['HomeTeam'], 'HShots'] += fixture['HS']` – Aqua- Oct 08 '20 at 23:34
To be clear, you're trying to update some larger dataframe `season_tbl` with values from a smaller frame `fixture`. However there are instances where `season_tbl` has columns that don't exist in `fixture`. If that's the case you want to only update the columns that `season_tbl` and `fixture` have in common? Is `season_tbl` always a superset of columns compared to `fixture`? – Cameron Riddell Oct 08 '20 at 23:46
You are **mostly** correct. I will describe exactly: Fixture is iterating over rows of the greater dataset match_list which contains every match in that specific season (the images in OP are from that CSV). season_tbl is a new DataSet i have created to for a final day table of the season (how many point collected, goal for, goals against, etc...). Each season_tbl will only present stats it can deduct from fixture, However, my functions are generic. That means that if, say, fixture contains 'HC' column in 14-15 season but not in 94-95 season the function should work for both of them. – Aqua- Oct 09 '20 at 10:10
This is the rest of previous comment: In this case i want it to take this new stat 'HC' whenever it exist (i will do it with filter) and update the cell if the column exist. I dont want to have to check for each column if it exist before i try to update. The filter method is just what i was looking for in that sense, and i feel like there should be an option to do so without actively checking if a specific column exist before updating the cell – Aqua- Oct 09 '20 at 10:10

How to take specific columns in pandas dataframe only if they exist (different CSVs)

1 Answers1