1

I have two csv file: list.csv company.csv

list.csv has four columns: company name,info1, info2, info3 where last three columns are empty and need to be updated based on data from company.csv

company.csv also has four columns: company name,info1, info2, info3

So what I want to do is go through company name column in list.csv and compare and find matching value in company.csv, then update info 1 to 3. I searched on Stackoverflow and it seems like I can use merge to solve this problem.

But another problem is that some of the company name in list.csv has misspelling.

For example, Stackoverflow is misspelled as Stackobarflow in list.csv. Therefore, instead of look for exactly the same company name in company.csv, I want to search for company name that has the same first four letters and regard it as same company.

How can I do this? I know I have to upload some codes when asking for a help but I have no idea how to do this in python


EDIT

list.csv

Company Name info1 info2 info3
Faceboook N/A N/A N/A
Stakoverflow N/A N/A N/A

company.csv

Company Name info1 info2 info3
Facebook 10 20 SNS
Stackoverflow 20 20 Coding

result.csv

Company Name info1 info2 info3
Faceboook 10 10 SNS
Stakoverflow 20 20 Coding
Dohun
  • 477
  • 2
  • 7
  • 13

2 Answers2

0

For handling misspelled companies, you would have to use something like a Levenstein distance, or a similar approach. Quick googling gave me this SO post with a comparison of different methods. There are packages available for Levenshtein distance, no need to implement it yourself.

It all depends on the size of your files. I would probably use merge first to handle the case where company names match exactly, and then check the Levenshtein distance for the remainder. If the files are small, then there is no need for the first step, just do Levenshtein for all vs all. One approach is to create a rectangular matrix of distances, with rows corresponding to company names from list.csv, and columns - to company.csv. Divide the distance by the maximum of string lengths, to normalise it. Pick a (normalised) distance threshold, below which the strings are considered similar.

Evgeny Tanhilevich
  • 1,119
  • 1
  • 8
  • 17
0

Supposing you have these dataframes:

df1:

   Company Name  info1  info2  info3
0  Non-Existent    NaN    NaN    NaN
1     Faceboook    NaN    NaN    NaN
2  Stakoverflow    NaN    NaN    NaN

df2:

    Company Name  info1  info2   info3
0       Facebook     10     20     SNS
1  Stackoverflow     20     20  Coding

Then:

df1["tmp"] = df1["Company Name"].str[:3]
df2["tmp"] = df2["Company Name"].str[:3]

df1 = df1.set_index("tmp")
df2 = df2.set_index("tmp")

df1 = df1.fillna(df2).reset_index(drop=True)
print(df1)

Prints:

   Company Name info1 info2   info3
0  Non-Existent   NaN   NaN     NaN
1     Faceboook    10    20     SNS
2  Stakoverflow    20    20  Coding
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
  • Thank you for the answer. As far as I understand, the purpose of the first two lines of the code is to get the first four letters of the company. Then would it be possible to give one more rule something like having first four letters and last two letters? – Dohun Apr 14 '21 at 04:46
  • And if the some data is already filled in (not NaN) and I want to update it based on the data from `df2`, what can I use instead of `df.fillna`? – Dohun Apr 14 '21 at 04:47