Subsetting a pandas dataframe based on column names stored in a vector

Question

I am scraping data from a website which builds a Pandas dataframe with different column names dependent on the data available on the site. I have a vector of column names, say:

colnames = ['column1', 'column2', 'column3', 'column5']

which are the columns of a postgres database for which I wish to store the scraped data in.

The problem I am having is that the way I have had to set up the scraping to get all the data I require, I end up grabbing some columns for which I have no use and which aren't in my postgres database. These columns will not have the same names each time, as some pages have extra data, so I can't simply exclude the column names I don't want, as I don't know what all of these will be. There will also be columns in my postgres database for which the data will not be scraped every time.

Hence, when I try and upload the resulting dataframe to postgres, I get the error:

psycopg2.errors.UndefinedColumn: column "column4" of relation "my_db" does not exist

This leads to my question:

How do I subset the resulting pandas dataframe using the column names I have stored in the vector, given some of the columns may not exist in the dataframe? I have tried my_dt = my_dt[colnames], which returns the error:

KeyError: ['column1', 'column2', 'column3'] not in index

Reproducible example:

df = pd.DataFrame([[1, 2, 3, 4], [5, 6, 7, 8]], columns = 
['column1', 'column2', 'column3', 'column4'])

subset_columns = ['column1', 'column2', 'column3', 'column5']

test = df[subset_columns]

Any help will be appreciated.

that should work. Can you please post a [Minimal, Complete, and Verifiable example](https://stackoverflow.com/help/mcve)? You can follow [this question](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) to create a reproducible pandas example — CAPSLOCK, May 02 '19 at 13:50
@Gio I forgot to mention part of the problem; all the column names I grab from the postgres database may not be in the pandas dataframe, which I now realise is what is probably causing the issue. — Michael Pyle, May 02 '19 at 14:04

score 2 · Accepted Answer · answered May 03 '19 at 07:59

2

You can simply do:

colnames = ['column1', 'column2', 'column3', 'column5']

df[df.columns & colnames]

answered May 03 '19 at 07:59

CAPSLOCK

6,243
3
33
56

Perfect, thanks. Still getting to grips with python and it's intricacies - am usually an R guy. – Michael Pyle May 03 '19 at 14:17

score 1 · Answer 2 · answered May 02 '19 at 14:26

I managed to find a fix, though I still don't understand what was causing the initial 'Key Error' to come out as a vector rather than just the elements which weren't columns of my dataframe:

df = pd.DataFrame([[1, 2, 3, 4], [5, 6, 7, 8]], columns = 
['column1', 'column2', 'column3', 'column4'])

subset_columns = ['column1', 'column2', 'column3', 'column5']

column_match = set(subset_columns) & set(df.columns)

df = df[column_match]

Out[69]: 
   column2  column1  column3
0        2        1        3
1        6        5        7

Because it's a KeyError defined in the pandas library: raise KeyError('%s not in index' % objarr[mask]) — micric, May 02 '19 at 14:45

Subsetting a pandas dataframe based on column names stored in a vector

2 Answers2