How to update a csv file based on data from another csv file?

Question

I have two csv file: list.csv company.csv

list.csv has four columns: company name,info1, info2, info3 where last three columns are empty and need to be updated based on data from company.csv

company.csv also has four columns: company name,info1, info2, info3

So what I want to do is go through company name column in list.csv and compare and find matching value in company.csv, then update info 1 to 3. I searched on Stackoverflow and it seems like I can use merge to solve this problem.

But another problem is that some of the company name in list.csv has misspelling.

For example, Stackoverflow is misspelled as Stackobarflow in list.csv. Therefore, instead of look for exactly the same company name in company.csv, I want to search for company name that has the same first four letters and regard it as same company.

How can I do this? I know I have to upload some codes when asking for a help but I have no idea how to do this in python

EDIT

list.csv

Company Name	info1	info2	info3
Faceboook	N/A	N/A	N/A
Stakoverflow	N/A	N/A	N/A

company.csv

Company Name	info1	info2	info3
Facebook	10	20	SNS
Stackoverflow	20	20	Coding

result.csv

Company Name	info1	info2	info3
Faceboook	10	10	SNS
Stakoverflow	20	20	Coding

Can you please edit your question and put there example input dataframes and expected output? — Andrej Kesely, Apr 13 '21 at 19:40
@AndrejKesely I edited question and put example dataframes and expected output — Dohun, Apr 13 '21 at 23:44

score 0 · Answer 1 · answered Apr 13 '21 at 20:47

For handling misspelled companies, you would have to use something like a Levenstein distance, or a similar approach. Quick googling gave me this SO post with a comparison of different methods. There are packages available for Levenshtein distance, no need to implement it yourself.

It all depends on the size of your files. I would probably use merge first to handle the case where company names match exactly, and then check the Levenshtein distance for the remainder. If the files are small, then there is no need for the first step, just do Levenshtein for all vs all. One approach is to create a rectangular matrix of distances, with rows corresponding to company names from list.csv, and columns - to company.csv. Divide the distance by the maximum of string lengths, to normalise it. Pick a (normalised) distance threshold, below which the strings are considered similar.

score 0 · Answer 2 · answered Apr 14 '21 at 00:00

0

Supposing you have these dataframes:

df1:

   Company Name  info1  info2  info3
0  Non-Existent    NaN    NaN    NaN
1     Faceboook    NaN    NaN    NaN
2  Stakoverflow    NaN    NaN    NaN

df2:

    Company Name  info1  info2   info3
0       Facebook     10     20     SNS
1  Stackoverflow     20     20  Coding

Then:

df1["tmp"] = df1["Company Name"].str[:3]
df2["tmp"] = df2["Company Name"].str[:3]

df1 = df1.set_index("tmp")
df2 = df2.set_index("tmp")

df1 = df1.fillna(df2).reset_index(drop=True)
print(df1)

Prints:

   Company Name info1 info2   info3
0  Non-Existent   NaN   NaN     NaN
1     Faceboook    10    20     SNS
2  Stakoverflow    20    20  Coding

answered Apr 14 '21 at 00:00

Andrej Kesely

168,389
15
48
91

Thank you for the answer. As far as I understand, the purpose of the first two lines of the code is to get the first four letters of the company. Then would it be possible to give one more rule something like having first four letters and last two letters? – Dohun Apr 14 '21 at 04:46
And if the some data is already filled in (not NaN) and I want to update it based on the data from `df2`, what can I use instead of `df.fillna`? – Dohun Apr 14 '21 at 04:47

How to update a csv file based on data from another csv file?

2 Answers2