I am scraping data from a website which builds a Pandas dataframe with different column names dependent on the data available on the site. I have a vector of column names, say:
colnames = ['column1', 'column2', 'column3', 'column5']
which are the columns of a postgres database for which I wish to store the scraped data in.
The problem I am having is that the way I have had to set up the scraping to get all the data I require, I end up grabbing some columns for which I have no use and which aren't in my postgres database. These columns will not have the same names each time, as some pages have extra data, so I can't simply exclude the column names I don't want, as I don't know what all of these will be. There will also be columns in my postgres database for which the data will not be scraped every time.
Hence, when I try and upload the resulting dataframe to postgres, I get the error:
psycopg2.errors.UndefinedColumn: column "column4" of relation "my_db" does not exist
This leads to my question:
How do I subset the resulting pandas dataframe using the column names I have stored in the vector, given some of the columns may not exist in the dataframe? I have tried my_dt = my_dt[colnames]
, which returns the error:
KeyError: ['column1', 'column2', 'column3'] not in index
Reproducible example:
df = pd.DataFrame([[1, 2, 3, 4], [5, 6, 7, 8]], columns =
['column1', 'column2', 'column3', 'column4'])
subset_columns = ['column1', 'column2', 'column3', 'column5']
test = df[subset_columns]
Any help will be appreciated.