10

If I have two dataframes (or series) that are already sorted on compatible keys, I'd like to be able to cheaply merge them together and maintain sortedness. I can't see a way to do that other than via concat() and explicit sort()

a = pd.DataFrame([0,1,2,3], index=[1,2,3,5], columns=['x'])
b = pd.DataFrame([4,5,6,7], index=[0,1,4,6], columns=['x'])
print pd.concat([a,b])
print pd.concat([a,b]).sort()

   x
1  0
2  1
3  2
5  3
0  4
1  5
4  6
6  7

   x
0  4
1  0
1  5
2  1
3  2
4  6
5  3
6  7

It looks like there has been a bit of related discussion with numpy arrays, suggesting an 'interleave' method, but I haven't found a good answer.

patricksurry
  • 5,508
  • 2
  • 27
  • 38
  • Good question! Interweave is different to sorting two already sorted arrays, but I'm *sure* I've seen a question about sorting two already sorted arrays in numpy (I can't find it)... it must be part of the mergesort implementation... :s – Andy Hayden May 01 '13 at 13:25
  • 1
    http://stackoverflow.com/questions/12427146/combine-two-arrays-and-sort – Andy Hayden May 01 '13 at 13:38
  • I like the heapq.merge() suggestion in the commments there, maybe I can use that but it doesn't seem like a native numpy thing? I want to take advantage of sortedness since with very large series merging when we know it's sorted should be linear in total length of the arrays, whereas any sort will be non-linear. (Ironically when I started using pandas I assumed the "merge" operation did just this, instead of being a form of join.) – patricksurry May 01 '13 at 14:35
  • I think what you want to do is a mergesort with `sort_index` (if I'm understanding the accepted answer), but that doesn't seem to be an option... yet. – Andy Hayden May 01 '13 at 16:42
  • @patricksurry "any sort will be non-linear", that's absolutely *false*. When we say that sort is `Ω(nlogn)` we refer to the average and worst cases. In your case calling the native `list.sort` method would take linear-time, since Timsort is *really good* at dealing with partially sorted data. – Bakuriu May 01 '13 at 17:16
  • That's a good point, I guess I mean "any naive sort". I'm essentially looking for a specialized sort that combines two already sorted arrays of different lengths. – patricksurry May 01 '13 at 17:44
  • @Bakuriu I don't *think* timsort is implemented in numpy? – Andy Hayden May 01 '13 at 21:28
  • @AndyHayden I never stated that. I just provided an example of sorting in python that takes linear time with the OP's data. – Bakuriu May 02 '13 at 05:20

1 Answers1

1

If we limit the problem to a and b having only one column, then I would go through this path:

s = a.merge(b, how='outer', left_index=True, right_index=True)
s.stack().reset_index(level=1, drop=True)
Zeugma
  • 31,231
  • 9
  • 69
  • 81