2

I have two large dataframes, both have datetime column. One of them the datetime format is %Y-%m-%dT%H:%M:%S and another one is in millisecond format: %Y-%m-%dT%H:%M:%S.%f. How can I join these two dataframes efficiently based on a part of datetime column (join based on seconds)?

I know that we can use pandas merge function or use something like:

df1.merge(df2, left_on=['df1_column'], right_on=['df2_column'])

But is there a way to customize this merge, so that we can join two dataframe based on one column from left side and part of another column from right side (without adding any additional column to the dataframes)?

    import pandas as pd

    df1 = pd.DataFrame({'datetime_s': ['2018-09-09T12:25:26', '2018-09-09T12:25:27','2018-09-09T12:25:28']\
                    ,'value': [1, 2, 3]})
    df2 = pd.DataFrame({'datetime_m': ['2018-09-09T12:25:26.780', '2018-09-09T12:25:26.650', '2018-09-09T12:25:27.320']\
                    ,'value': [5, 6, 7]})
    # add something to the following line to merge them based on the second
    df1.merge(df2, left_on=['datetime_s'], right_on=['datetime_m'])

the output should be:

datetime_s             datetime_m                 df1.value   df2.value
2018-09-09T12:25:26    2018-09-09T12:25:26.780      1           5
2018-09-09T12:25:26    2018-09-09T12:25:26.650      1           6
2018-09-09T12:25:27    2018-09-09T12:25:27.320      2           7
Enayat
  • 3,904
  • 1
  • 33
  • 47

1 Answers1

0

How about :

df1.merge(df2, left_on=df1.df1_column.apply(lambda x :x.second), 
              right_on=df2.df2_column.apply(lambda x : x.second))
Ayoub ZAROU
  • 2,387
  • 6
  • 20
  • It might work for small dataframes, but for large dataframes (even with ~ 1 million records each), Python gives memory error. – Enayat Jul 17 '19 at 15:32