0

I use pandas for this task. I imported two excel files:

  • File A contains company names and some other stuff.
  • File B contains company names, industries a company belongs to indicated by a number between 1 to 14, and some other stuff I don't need.

I want to compare these two files and find a match, and return the corresponding industry number, then make a new column in File A to show what industry a company belongs to by the number.

What I have done so far is to extract two columns I need from each file and make them into a list. Then to keep the association, I put them into a dictionary. Then I used for loop and nested for loop to find a match. But I don't know how to go further from here. And also one problem came up, which is the way companies are listed in the two files are somewhat same but not exactly. So I want to allow it to be a match if more than 4 sequences of characters in the names match.

com = A["comany"].tolist()
indu = A['industry'].tolist()

sponsor = B["sponsor"].tolist()
event = B["Event"].tolist()

dicA = dict(zip(com, indu))
dicB = dict(zip(sponsor, event))

import re
for spnsr in dicB:
    for company, industry in dicA.items():
        m = re.search(spnsr, company)
        if m:
            m = m.group()
            print(m, industry)
Tomerikoo
  • 18,379
  • 16
  • 47
  • 61
  • `df_a` and `df_b` have inconsistent names? Other than that, it would be a straightforward merge of the columns in `df_a` on those in `df_b`. But with inconsistent names, that becomes more difficult. You would have to do some sort of nearest merge where distance is defined between two keys by a string distance. But such a functionality isn't present in Pandas. You could, however, extract the merge to a function which would find your matching row and then return the value at that row, applying that row-wise across your data. – ifly6 Aug 09 '19 at 20:39
  • Please take a look at [How to create good pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) and update your question to include some sample input(s) and output – G. Anderson Aug 09 '19 at 21:25
  • Possible duplicate of [Pandas Merging 101](https://stackoverflow.com/questions/53645882/pandas-merging-101) – help-ukraine-now Aug 09 '19 at 22:05

1 Answers1

0
import pandas as pd
import numpy as np

# Read both Excel files
file1 = pd.read_excel("file1.xlsx", na_values=['NA'])
file2 = pd.read_excel("file2.xlsx", na_values=['NA'])


df2 = file1
df1 = file2



res = df1[df1['samecolname'].isin(df2['samecolname'].unique())]
                   
res2 = df2[df2['samecolname'].isin(df1['samecolname'].unique())]               

res.to_excel('diff1-insecond-but-not-in-first.xlsx',index=False)
res2.to_excel('diff2-in-first-not-in-second.xlsx',index=False)
indera
  • 81
  • 1
  • 2