Python Pandas : compare the rows of two csv(dataframe) for similar values along one column and return content of rows (columns) with similarites

Question

I am working with two csv files and imported as dataframe. Let's say df1 and df2 as in picture, df1 and df2 which are of different length. df1 has 50000 rows and df2 has 20000 rows.

df1

df2

I want to compare (iterate through rows) the 'time' of df2 with df1, find the difference in time and return the values of all column corresponding to similar row.

For example, 66 (of 'time' in df1) is nearest to 67 (of 'time' in df2), So I would like to return the contents to df1 (15'vel' and 25'yaw') to df2 and save as a new csv

Welcome to SO. Please provide a **[mcve]**. Also see: **[How to make good reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples)**. — jpp, Apr 20 '18 at 11:26
Please take the time to read those linked pages. Images are *not* good for providing a [good pandas example](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples). — jpp, Apr 20 '18 at 11:36

score 0 · Answer 1 · answered Apr 20 '18 at 11:36

Create cartesian product and then do filtering -

df1 = pd.DataFrame({'time': [11, 22, 33,44, 55, 66,77,88,99], 'vel':[10, 11,12,13,14,15,16,17,18],
                   'yaw' : [20, 21, 22,23,24,25,26,27,28]})
df2 = pd.DataFrame({'time' : [67, 75, 87, 99]})

df1['key'] = 1
df2['key'] = 1
df1.rename(index=str, columns ={'time' : 'time_x'}, inplace=True)

df = df2.merge(df1, on='key', how ='left').reset_index()
df['diff'] = df.apply(lambda x: abs(x['time']  - x['time_x']), axis=1)
df.sort_values(by=['time', 'diff'], inplace=True)

df=df.groupby(['time']).first().reset_index()[['time', 'vel', 'yaw']]

Thanks for the solution, I run out of memory and kernel dies. Since i have almost df1-8000 rows values and df2-4000 values in real problem., Do you have any solution regrading memory issues @Aritesh — ankushbraj, Apr 20 '18 at 12:43

nisahbhtrap · Accepted Answer · 2018-04-20T13:09:54.783

It can be done with the help of iterrows() function.

Here is the code :

first table creation :

value=[(0,11,10,20),(1,22,11,21),(2,33,12,22),(3,44,13,23),(4,55,14,24), 
(5,66,15,25),(6,77,16,26),(7,88,17,27),(8,99,18,28)]
header=["index","time","vel","yaw"]
df1 = pd.DataFrame.from_records(value, columns=header)

second table creation:

value=[(0,67,"nan","nan"),(1,75,"nan" ,"nan" ),(2,87,"nan" ,"nan" ) 
(3,99,"nan" ,"nan" )]
header=["index","time","vel","yaw"]
df2 = pd.DataFrame.from_records(value, columns=header)

storing the result in df2 itself:

for index, row in df2.iterrows():
    min=10000000
    for indexer, rows in df1.iterrows():
        if abs(row['time']-rows['time'])<min:
            min = abs(row['time']-rows['time'])
            #storing the position 
            pos = indexer
    df2.loc[index,'vel'] = df1['vel'][pos]
    df2.loc[index,'yaw'] = df1['yaw'][pos]

Python Pandas : compare the rows of two csv(dataframe) for similar values along one column and return content of rows (columns) with similarites

df1

df2

2 Answers2