3

I have two lists each populated with timestamps, list_a and list_b. What is the best way using np.searchsorted to find the most recent entry in list_a for each entry in list_b? The result would be a list_a_updated where each x in list_a_updated matches straight across to its corresponding (and later) entry in list_b. This question is very similar to this question

pandas.merge: match the nearest time stamp >= the series of timestamps

but a little bit different.

It embarrass me that I cannot just how to reverse this so it grabs the <= timestamp instead of the >= timestamp but I have been working with this for a while and it is less obvious than it seems. My example code is:

#in this code tradelist is list_b, balist is list_a

tradelist=np.array(list(filtereddflist[x][filtereddflist[x].columns[1]]))
df_filt=df_filter(filtereddflist2[x], 2, "BEST_BID" )
balist=np.array(list(df_filt[df_filt.columns[1]]))

idx=np.searchsorted(tradelist,balist)-1
mask= idx <=0

df=pd.DataFrame({"tradelist":tradelist[idx][mask],"balist":balist[mask]})

And the solution is not as simple as just switching the inequality.

If it helps at all I am dealing with trade and bid stock data and am trying to find the most recent bid (list_a) for each trade (list_b) without having to resort to a for loop.

Community
  • 1
  • 1
sfortney
  • 2,075
  • 6
  • 23
  • 43
  • 1
    Look at the `side` keyword argument of [`np.searchsorted`](http://docs.scipy.org/doc/numpy/reference/generated/numpy.searchsorted.html), I think you only need to set `side='right'` and will be 80% there. – Jaime Mar 27 '15 at 19:36
  • Thanks! I'm not quite sure how that would be any different though from just swapping the argument order. Are the two equivalent? – sfortney Mar 27 '15 at 19:58
  • 1
    They have nothing to do... I have produced a complete answer, see if that makes sense. – Jaime Mar 27 '15 at 21:35
  • Ah yes. It does something totally different. You are right. And thank you for the complete answer. I have accepted it. I just tested it out with my code and it works. – sfortney Mar 28 '15 at 16:51

1 Answers1

1

To make our life easier, lets use numbers instead of timestamps:

>>> a = np.arange(0, 10, 2)
>>> b = np.arange(1, 8, 3)
>>> a
array([0, 2, 4, 6, 8])
>>> b
array([1, 4, 7])

The last timestamps in a that are smaller than or equal to each item in b would be [0, 4, 6], which correspond to indices [0, 2, 3], which is exactly what we get if we do:

>>> np.searchsorted(a, b, side='right') - 1
array([0, 2, 3])
>>> a[np.searchsorted(a, b, side='right') - 1]
array([0, 4, 6])

If you don't use side='right' then you would get wrong values for the second term, where there is an exactly matching timestamp in both arrays:

>>> np.searchsorted(a, b) - 1
array([0, 1, 3])
Jaime
  • 65,696
  • 17
  • 124
  • 159