0

I have a somewhat general question about combining multiple excel files together. Normally, I would use pd.read_excel to read the files then concat to join. However, I have some cases where the field names are not exactly the same but similar. For example,

one sheet would have fields like: Apple, Orange, Size, Id

another sheet would be: Apples, orange, Sizes, #

I have use the rename columns function but with this I have to check and compare every names in each files. I wonder if there's any way to combine them without going through all the field names. Any thought? THANKS!

TylerNG
  • 919
  • 2
  • 9
  • 23
  • 2
    There is hardly a good shortcut to knowing what is in your files. If you have thousands of fields you'd want to match very similar names (perhaps using Levenshtein Distance). But even here, what's your tolerance for false positive matches? Or the opposite? The question can only be answered knowing that. Otherwise, if the field count is short, loop through and open the files, and store their names in a dictionary with the aim of creating a global mapping of field names in existence. – ako Apr 13 '18 at 17:28
  • Thanks! that what I thought too but was just curious if there's some new feature I didn't know of :) – TylerNG Apr 13 '18 at 17:33
  • 1
    Is this of use? https://stackoverflow.com/questions/13636848/is-it-possible-to-do-fuzzy-match-merge-with-python-pandas – RCA Apr 13 '18 at 18:07
  • @RCA not quite because I just want to stack them on top of each other not join – TylerNG Apr 13 '18 at 18:25

1 Answers1

1

Define what it means for two strings to be the same, then you can do the renaming automatically (you'll also need to determine what the "canonical" form of the string is - the name that you'll actually use in the final dataframe). The problem is pretty general, so you'll have to decide based on the sort of column names that you're willing to consider the same, but one simple thing might be to use a function like this:

def compare_columns(col1: str, col2: str) -> bool:
    return col1.lower() == col2.lower()

Here you'd be saying that any two columns with the same name up to differences in case are considered equal. You'd probably want to define the canonical form for a column to be all lowercase letters.

Actually, now that I think about it, since you'll need a canonical form for a column name anyway, the easiest approach would probably be, instead of comparing names, to just convert all names to canonical form and then merge like usual. In the example here, you'd rename all columns of all your dataframes to be their lowercase versions, then they'll merge correctly.

The hard part will be deciding what transforms to apply to each name to get it into canonical form. Any transformation you do has a risk of combining data that wasn't mean to be (even just changing the case), so you'll need to decide for yourself what's reasonable to change based on what you expect from your column names.

As @ako said, you could also do this with something like Levenstein distance, but I think that will be trickier than just determining a set of transforms to use on each column name. With Levenstein or similar, you'll need to decide which name to rename to, but you'll also have to track all names that map to that name and compute the Levenstein distance between the closest member of that group when deciding if a new name maps to that canonical name (e.g. say that you have "Apple" and "Aple" and "Ale" and are merging names with edit distance of 1 or less. "Apple" and "Aple" should be merged, as should "Aple" and "Ale". "Apple" and "Ale" normally shouldn't be (as their distance is 2), but because they both merge with "Aple", they also merge with each other now).

You could also look into autocorrect to try to convert things like "Aple" to "Apple" without needing "Ale" to also merge in; I'm sure there's some library for doing autocorrect in Python. Additionally, there are NLP tools that will help you if you want to do stemming to try to merge things like "Apples" and "Apple".

But it'll all be tricky. Lowercasing things probably works, though =)

Nathan
  • 9,651
  • 4
  • 45
  • 65