Find string in one csv and replace with string in a different csv in a loop

Question

I have two csv files. csv1 looks like this:

Title,glide gscore,IFDScore
235,-9.01,-1020.18
235,-8.759,-1020.01
235,-7.301,-1019.28

while csv2 looks like this:

ID,smiles,number
28604361,NC(=O)CNC(=O)CC(c(cc1)cc(c12)OCO2)c3ccccc3,102
14492699,COc1cccc(c1OC)C(=O)N2CCCC(C2)CCC(=O)Nc3ccc(F)cc3C,235
16888863,COc1cc(ccc1O)CN2CCN(CC=C(C)C)C(C2)CCO,108

Both are much larger than what I show here. I need some way to match each value in the Title column of csv1 to the corresponding value in the number column of csv2. When a match is found, I need the value in the Title column of csv1 to be replaced with the corresponding value in the ID column of csv2. Thus I would want my desired output to be:

Title,glide gscore,IFDScore
14492699,-9.01,-1020.18
14492699,-8.759,-1020.01
14492699,-7.301,-1019.28

I am looking for a way to do it through pandas, bash or python.

This answer is close but gives me an ambiguous truth value of a DataFrame.
I also tried update in pandas without luck.

I'm not pasting the exact code I've tried yet because it would be overwhelming to see faulty code in pandas, bash and python all at once.

TheFaultInOurStars · Answer 1 · 2022-03-17T19:02:46.020

1

You can use pandas module to load your dataframe, and then, using merge function, you can achieve what you are seeking for:

import pandas as pd
df1 = pd.read_csv("df1.csv")
df2 = pd.read_csv("df2.csv")

merged = df1.merge(df2, left_on="Title", right_on="number", how="right")
merged["Title"] = merged["ID"]
merged

Output

	Title	glide gscore	IFDScore	ID	smiles	number
0	28604361	nan	nan	28604361	NC(=O)CNC(=O)CC(c(cc1)cc(c12)OCO2)c3ccccc3	102
1	14492699	-9.01	-1020.18	14492699	COc1cccc(c1OC)C(=O)N2CCCC(C2)CCC(=O)Nc3ccc(F)cc3C	235
2	14492699	-8.759	-1020.01	14492699	COc1cccc(c1OC)C(=O)N2CCCC(C2)CCC(=O)Nc3ccc(F)cc3C	235
3	14492699	-7.301	-1019.28	14492699	COc1cccc(c1OC)C(=O)N2CCCC(C2)CCC(=O)Nc3ccc(F)cc3C	235
4	16888863	nan	nan	16888863	COc1cc(ccc1O)CN2CCN(CC=C(C)C)C(C2)CCO	108

Note that the Nan values are due to unavailable values. If your dataframe covers these parts too, it won't result in Nan.

edited Mar 17 '22 at 19:02

answered Mar 17 '22 at 18:45

TheFaultInOurStars

3,464
1
8
29

If you switch df1 for df2 when loading the csv files, and swap the `Title` in the merged file so it looks like `merged["number"] = merged["ID"]` and then use `merged.drop(columns=['smiles', 'number', 'Title'])` this would work – Christian Seitz Mar 17 '22 at 19:01
1

@ChristianSeitz Thanks for the comment, Christian. There is not much difference in changing the `number` or `Title` column. Both will work fine. And about the drop part, of course, dropping would make more precise, but since I was not sure whether it is of interest to you, I decided not to drop anything. – TheFaultInOurStars Mar 17 '22 at 19:04

score 1 · Accepted Answer · answered Mar 17 '22 at 18:54

You could map it; then use fillna in case there were any "Titles" that did not have a matching "number":

csv1 = pd.read_csv('first_csv.csv')
csv2 = pd.read_csv('second_csv.csv')
csv1['Title'] = csv1['Title'].map(csv2.set_index('number')['ID']).fillna(csv1['Title']).astype(int)

Output:

      Title  glide gscore  IFDScore
0  14492699        -9.010  -1020.18
1  14492699        -8.759  -1020.01
2  14492699        -7.301  -1019.28

Find string in one csv and replace with string in a different csv in a loop

2 Answers2

Output