3

I have two pandas dataframes with following format:

df_ts = pd.DataFrame([
        [10, 20, 1,  'id1'],
        [11, 22, 5,  'id1'],
        [20, 54, 5,  'id2'],
        [22, 53, 7,  'id2'],
        [15, 24, 8,  'id1'],
        [16, 25, 10, 'id1']
    ], columns = ['x', 'y', 'ts', 'id'])


df_statechange = pd.DataFrame([
        ['id1', 2, 'ok'],
        ['id2', 4, 'not ok'],
        ['id1', 9, 'not ok']
    ], columns = ['id', 'ts', 'state'])

I am trying to get it to the format, such as:

df_out = pd.DataFrame([
        [10, 20, 1,  'id1', None    ],
        [11, 22, 5,  'id1', 'ok'    ],
        [20, 54, 5,  'id2', 'not ok'],
        [22, 53, 7,  'id2', 'not ok'],
        [15, 24, 8,  'id1', 'ok'    ],
        [16, 25, 10, 'id1', 'not ok']
    ], columns = ['x', 'y', 'ts', 'id', 'state'])

I understand how to accomplish it iteratively by grouping by id and then iterating through each row and changing status when it appears. Is there a pandas build-in more scalable way of doing this?

ymoiseev
  • 416
  • 5
  • 18
  • I cant figure out the relating between `df_statechange` and `df_ts` to `df_out`, could you explain a little bit more – Phung Duy Phong May 22 '17 at 07:01
  • I think you also need `cut` aside from `merge`. See this thread for more details http://stackoverflow.com/questions/42649224/pandas-alternate-way-to-add-new-column-with-lot-of-conditions-other-than-apply – arnold May 22 '17 at 10:43
  • I believe you would have to create a function to join the frames as you desire. Basically take a row of the data frame and compare it over the rows of the other dataframe. apply the function over the first dataframe. – user2685079 Jun 11 '17 at 14:04

2 Answers2

2

Unfortunately pandas merge support only equality joins. See more details at the following thread: merge pandas dataframes where one value is between two others if you want to merge by interval you'll need to overcome the issue, for example by adding another filter after the merge:

joined = a.merge(b,on='id')
joined = joined[joined.ts.between(joined.ts1,joined.ts2)]
Community
  • 1
  • 1
Dimgold
  • 2,748
  • 5
  • 26
  • 49
0

You can merge pandas data frames on two columns:

pd.merge(df_ts,df_statechange, how='left',on=['id','ts'])

in df_statechange that you shared here there is no common values on ts in both dataframes. Apparently you just copied not complete data frame here. So i got this output:

    x   y  ts   id state
0  10  20   1  id1   NaN
1  11  22   5  id1   NaN
2  20  54   5  id2   NaN
3  22  53   7  id2   NaN
4  15  24   8  id1   NaN
5  16  25  10  id1   NaN

But indeed if you have common ts in the data frames it will have your desired output. For example:

df_statechange = pd.DataFrame([
        ['id1', 5, 'ok'],
        ['id1', 8, 'ok'],
        ['id2', 5, 'not ok'],
        ['id2',7, 'not ok'],
        ['id1', 9, 'not ok']
    ], columns = ['id', 'ts', 'state'])

the output:

  x   y  ts   id   state
0  10  20   1  id1     NaN
1  11  22   5  id1      ok
2  20  54   5  id2  not ok
3  22  53   7  id2  not ok
4  15  24   8  id1      ok
5  16  25  10  id1     NaN
SillyPerson
  • 589
  • 2
  • 7
  • 30
  • 2
    I am trying to merge dataframes on intervals, so often there are no common values. Basically idea behind df_statechange is that it carries over state of this id for all timestamps in df_ts, until this state is changed. – ymoiseev May 23 '17 at 01:38