1

I have two DataFrames, I need to merge both, and I need to add a column that specifies if it is accepted or not.

I have this:

dfa[dfa.CONTROL.isin([334030860978638])]

Out[107]:
            CONTROL             A               B               DATE_HOUR
1629136     334030860978638     525562414612    52447860015000  2015-08-02 16:32:00
1629137     334030860978638     525562414612    52447860015000  2015-08-02 16:42:32
1629138     334030860978638     525562414612    52447860015000  2015-08-02 18:33:12
1629139     334030860978638     525562414612    52447860015000  2015-08-03 19:40:19


dfb[dfb.control.isin([334030860978638])]

Out[108]:
            control             a               b               date_hour
id              
299366338   334030860978638     525562414612    447860015000    2015-08-02 16:33:08
299392621   334030860978638     525562414612    447860015000    2015-08-02 16:43:40
299665465   334030860978638     525562414612    447860015000    2015-08-02 18:34:21

view = dfa.merge(dfb, left_on=['CONTROL', 'A', 'B'],
                right_on=['control', 'a', 'b'], how='outer')

I need to compare DATE_HOUR, with date_hour and if records are in time range for example 3600 seconds, also I need to determine if multiple records exist in time then I'll get the nearest one and to mark it, in a new column accepted I'll set as True, otherwise False.

my expected output:

 CONTROL            A               B               DATE_HOUR           control             a               b               date_hour           accepted
 334030860978638    525562414612    52447860015000  2015-08-02 16:32:00 334030860978638     525562414612    52447860015000  2015-08-02 16:32:08 True
 334030860978638    525562414612    52447860015000  2015-08-02 16:42:32 334030860978638     525562414612    52447860015000  2015-08-02 16:43:40 True
 334030860978638    525562414612    52447860015000  2015-08-02 18:33:12 334030860978638     525562414612    52447860015000  2015-08-02 18:34:21 True
 334030860978638    525562414612    52447860015000  2015-08-03 19:40:19 NaN                 NaN             Nan             NaT                 False

Can I use the apply method to this task?, can someone help me to do in the right way using pandas.

paridin
  • 429
  • 7
  • 14
  • This is a very interesting problem. Image that your `dfa` already has the columns from `dfb`, only the values are missing. Then it becomes a missing data problem, where you basically want to solve for nearest neighbor for each row of `dfa`. Start with grouping on `CONTROL` and `control`, then sorting on `DATE_HOUR` and `date_hour`. Next, you'll have to look up and adapt a nearest neighbor algorithm for your cause. – Kartik Nov 24 '15 at 21:45
  • I'll take your suggestion and thanks for that, really I feel confused. – paridin Nov 25 '15 at 05:39

1 Answers1

0

This similar problem helps me to solve my issue.

def nearest(group, match, groupname, lname, rname, name_field_diff='diff_minutes'):
    match = match[match[groupname] == group.name]
    try:
        nbrs = NearestNeighbors(1).fit(match[rname].values[:, None])
        dist, ind = nbrs.kneighbors(group[lname].values[:, None])
        group[lname] = group[lname]
        group[rname] = match[rname].values[ind.ravel()]
        time_diff = (group[rname] - group[lname]) / np.timedelta64(1, 'm')        
        group[name_field_diff] = time_diff.abs()
    except:
        pass
    return group

d1 = [{'CONTROL':334030860978638, 'A': 525562414612, 'B': 52447860015000, 'DATE_HOUR': '2015-08-02 16:32:00'},
{'CONTROL':334030860978638, 'A': 525562414612, 'B': 52447860015000, 'DATE_HOUR': '2015-08-02 16:42:32'},
{'CONTROL':334030860978638, 'A': 525562414612, 'B': 52447860015000, 'DATE_HOUR': '2015-08-02 18:33:12'},
{'CONTROL':334030860978638, 'A': 525562414612, 'B': 52447860015000, 'DATE_HOUR': '2015-08-02 19:40:19'}]

d2 = [{'control':334030860978638, 'a': 525562414612, 'b': 52447860015000, 'date_hour': '2015-08-02 16:33:08'},
{'control':334030860978638, 'a': 525562414612, 'b': 52447860015000, 'date_hour': '2015-08-02 16:43:40'},
{'control':334030860978638, 'a': 525562414612, 'b': 52447860015000, 'date_hour': '2015-08-02 18:34:21'}]

df1 = pd.DataFrame(d1)
df1.DATE_HOUR = pd.to_datetime(df1.DATE_HOUR, format='%Y-%m-%d %H:%M:%S')

df2 = pd.DataFrame(d2)
df2.date_hour = pd.to_datetime(df2.date_hour, format='%Y-%m-%d %H:%M:%S')

df1.groupby('CONTROL').apply(nearest, df2, 'control', 'DATE_HOUR', 'date_hour')

    A               B               CONTROL             DATE_HOUR               date_hour               diff_minutes
0   525562414612    52447860015000  334030860978638     2015-08-02 16:32:00     2015-08-02 16:33:08     1.133333
1   525562414612    52447860015000  334030860978638     2015-08-02 16:42:32     2015-08-02 16:43:40     1.133333
2   525562414612    52447860015000  334030860978638     2015-08-02 18:33:12     2015-08-02 18:34:21     1.150000
3   525562414612    52447860015000  334030860978638     2015-08-02 19:40:19     2015-08-02 18:34:21     65.966667

And now I filter using my gap to determine which records doesn't fit.

df1[df1.index.isin(view[(view.diff_minutes >= 60)].index)]

    A               B               CONTROL             DATE_HOUR
3   525562414612    52447860015000  334030860978638     2015-08-02 19:40:19
Community
  • 1
  • 1
paridin
  • 429
  • 7
  • 14