how to handle different spelling of column names when extracting data?

Question

For this example i have 2 dataframes, the genre column in df1 is column 3 but in df2 it is column 2, also the header is slightly different. in my actual script i have to search the column names because the column location varies in each sheet it reads.

how do i recognise different header names as the same thing?

df1 = pd.DataFrame({'TITLE': ['The Matrix','Die Hard','Kill Bill'],
               'VENDOR ID': ['1234','4321','4132'],
               'GENRE(S)': ['Action', 'Adventure', 'Drama']})

df2 = pd.DataFrame({'TITLE': ['Toy Story','Shrek','Frozen'],
               'Genre': ['Animation', 'Adventure', 'Family'],
               'VENDOR ID': ['5678','8765','8576']})

column_names = ['TITLE','VENDOR ID','GENRE(S)']

appended_data = []

sheet1 = df1[df1.columns.intersection(column_names)]
appended_data.append(sheet1)
sheet2 = df2[df2.columns.intersection(column_names)]
appended_data.append(sheet2)

appended_data = pd.concat(appended_data, sort=False)

output:

        TITLE VENDOR ID   GENRE(S)
0  The Matrix      1234     Action
1    Die Hard      4321  Adventure
2   Kill Bill      4132      Drama
0   Toy Story      5678        NaN
1       Shrek      8765        NaN
2      Frozen      8576        NaN

desired output:

        TITLE VENDOR ID   GENRE(S)
0  The Matrix      1234     Action
1    Die Hard      4321  Adventure
2   Kill Bill      4132      Drama
0   Toy Story      5678  Animation
1       Shrek      8765  Adventure
2      Frozen      8576     Family

You need to provide us with the table you are getting as output and the table you want it to look like. [Read](https://stackoverflow.com/help/minimal-reproducible-example) about minimal and reproducible questions. — Ukrainian-serge, Feb 27 '20 at 19:57
i can't show the output as it contains data i can't share. however, the output i get is a spreadsheet with column_names as the headings. the issue is, for example, the "GENRE(S)" column will contain data extracted from the source sheets that have a column called "GENRES(S)", however when a sheet has a column called "Genre" instead - the cells are empty because that column name doesn't match. the other issue is the columns aren't in the same order in the source sheets which is why i search the column name rather than column index. — robhfs, Feb 27 '20 at 20:55
A picture is worth a thousand words. Instead of describing it you must create input and intended output DUMMY data. — Ukrainian-serge, Feb 27 '20 at 21:24
added examples to the post. if i can't specify the column names or their index numbers, is it still possible to extract the data? — robhfs, Feb 27 '20 at 23:55
That isn't reproducible. Have you read [How to make good reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples)? Please read that and then edit your question in a form that is reproducible. — Ukrainian-serge, Feb 28 '20 at 01:20
ok i've redone the above with a trimmed down reproducible version — robhfs, Feb 28 '20 at 17:03

Ukrainian-serge · Answer 1 · 2020-02-28T20:02:58.053

Thank you for taking the time to do that. Asking a good questions is very important and now that you have posed a coherent question I was able to find a simple solution rather quickly:

import pandas as pd

df1 = pd.DataFrame({'TITLE': ['The Matrix','Die Hard','Kill Bill'],
                'VENDOR ID': ['1234','4321','4132'],
                 'GENRE(S)': ['Action', 'Adventure', 'Drama']})

df2 = pd.DataFrame({'TITLE': ['Toy Story','Shrek','Frozen'],
                    'Genre': ['Animation', 'Adventure', 'Family'],
                'VENDOR ID': ['5678','8765','8576']})

Simple way:
We will use .append() below but for this to work, we need columns in df1 and df2 to match. In this case we'll simply replace df2's 'Genre' to 'GENRE(S)'

df2.columns = ['TITLE', 'GENRE(S)', 'VENDOR ID']

df3 = df1.append(df2)
print(df3)

    GENRE(S)       TITLE VENDOR ID
0     Action  The Matrix      1234
1  Adventure    Die Hard      4321
2      Drama   Kill Bill      4132
0  Animation   Toy Story      5678
1  Adventure       Shrek      8765
2     Family      Frozen      8576

More elaborate:
Now, for a single use case this works but there may be cases where you have many mismatched columns and/or have to do this repeatedly. Here is a solution using boolean indexing to find mismatched names, then zip() and .rename() to map the column names:

# RELOAD YOUR ORIGINAL DF'S 

df1_find = df1.columns[~df1.columns.isin(df2.columns)] # select col name that isnt in df2
df2_find = df2.columns[~df2.columns.isin(df1.columns)] # select col name that isnt in df1

zipped = dict(zip(df2_find, df1_find)) # df2_find as key, df1_find as value

df2.rename(columns=zipped, inplace=True) # map zipped dict to the column names

df3 = df1.append(df2)
print(df3)

    GENRE(S)       TITLE VENDOR ID
0     Action  The Matrix      1234
1  Adventure    Die Hard      4321
2      Drama   Kill Bill      4132
0  Animation   Toy Story      5678
1  Adventure       Shrek      8765
2     Family      Frozen      8576

Keep in mind:

this way of doing it assumes that both your df's have the same count of columns
this ALSO assumes that df1 has your ideal column names which you will use against other dfs to fix their column names

I hope this helps.

how to handle different spelling of column names when extracting data?

1 Answers1