44

I have two DataFrames in pandas, trying to merge them. But pandas keeps changing the order. I've tried setting indexes, resetting them, no matter what I do, I can't get the returned output to have the rows in the same order. Is there a trick? Note we start out with the loans order 'a,b,c' but after the merge, it's "a,c,b".

import pandas
loans = [  'a',  'b', 'c' ]
states = [  'OR',  'CA', 'OR' ]
x = pandas.DataFrame({ 'loan' : loans, 'state' : states })
y = pandas.DataFrame({ 'state' : [ 'CA', 'OR' ], 'value' : [ 1, 2]})
z = x.merge(y, how='left', on='state')

But now the order is no longer the original 'a,b,c'. Any ideas? I'm using pandas version 11.

user2543623
  • 1,452
  • 2
  • 15
  • 24
  • 2
    For one thing, you need to pass `sort=False` or it will sort on the join key, which you don't want. But that isn't sufficient to solve the problem; the unsorted order is still going to group together all rows that came from the same source row. One easy workaround is to do `x.merge(x.merge(y, how='left', on='state', sort=False))`, which will merge each row in `x` with the corresponding for in the merge, which restores the original order of `x`. But hopefully there's a better solution that's escaping my brain at the moment. – abarnert Nov 26 '13 at 01:15
  • @abarnert, I think that we could use `.join()` or `.update()` which (for some reason) they do preserve the order. – Outcast Jul 18 '19 at 09:39
  • 1
    The example in the question no longer holds, now testing with pandas==1.3.5 – mona-mk Aug 25 '22 at 10:41

6 Answers6

30

Hopefully someone will provide a better answer, but in case no one does, this will definitely work, so…

Zeroth, I'm assuming you don't want to just end up sorted on loan, but to preserve whatever original order was in x, which may or may not have anything to do with the order of the loan column. (Otherwise, the problem is easier, and less interesting.)

First, you're asking it to sort based on the join keys. As the docs explain, that's the default when you don't pass a sort argument.


Second, if you don't sort based on the join keys, the rows will end up grouped together, such that two rows that merged from the same source row end up next to each other, which means you're still going to get a, c, b.

You can work around this by getting the rows grouped together in the order they appear in the original x by just merging again with x (on either side, it doesn't really matter), or by reindexing based on x if you prefer. Like this:

x.merge(x.merge(y, how='left', on='state', sort=False))

Alternatively, you can cram an x-index in there with reset_index, then just sort on that, like this:

x.reset_index().merge(y, how='left', on='state', sort=False).sort('index')

Either way obviously seems a bit wasteful, and clumsy… so, as I said, hopefully there's a better answer that I'm just not seeing at the moment. But if not, that works.

abarnert
  • 354,177
  • 51
  • 601
  • 671
  • Yes, that seems to work. Seems wasteful, but it does work. Thanks. – user2543623 Nov 26 '13 at 17:27
  • @Wes Mckinney: I haven't been able to find an explanation for why the indices get re-ordered. It's quite frustrating as I've been working with this all afternoon thinking it' a bug in my code. Any explanation would be greatly appreciated! – slaw Oct 02 '14 at 22:29
  • @abarnert Thanks. Should this work with non-`left` merges too? It produces working code, but hard to see if the result is correct (I'm guessing it works, since nb of rows with and without using this sorting are the same afterwards...) – The Unfun Cat Jun 18 '15 at 20:43
  • For `pd.merge(x,y)` default is `sort=True`, however for `x.merge(y)`, default is `sort=False` – user3226167 Jan 27 '17 at 02:09
  • 2
    I needed .sort_index() instead of .sort('index') in the latest version of Pandas – Josiah Yoder Jul 25 '18 at 17:38
  • @JosiahYoder If you're using a newer version of Pandas, doesn't `ordered_merge` (from [Claygirl's answer](https://stackoverflow.com/a/36947209/908494) remove the need for double-sorting like this in the first place? (In fact, I think it could have been done more simply even in 2013, but if I wanted to revisit this question, I'd want to do it with 0.24, not 0.7…) – abarnert Jul 25 '18 at 17:52
  • @abarnert ordered_merge is still undocumented, and I could not figure out from examples online if it assumed the data was already in order or not. – Josiah Yoder Jul 25 '18 at 17:55
  • @JosiahYoder IIRC, it had some flaw, so they deprecated it and replaced it with a renamed function. – abarnert Jul 25 '18 at 18:00
  • @JosiahYoder At any rate, I haven't thought about this problem in half a decade, and, as you can see from my answer, I wasn't too confident in my answer even for 0.7, so… take my guesses with a large grain of salt here. – abarnert Jul 25 '18 at 18:01
  • @abarnert I confirmed that `(years.UniqueCol == joined.UniqueCol).all()` was true, and I would recommend that others do that, too! – Josiah Yoder Jul 25 '18 at 18:03
  • Or perhaps you could use `.join()` or `.update()` which (for some reason) they do preserve the order. – Outcast Jul 18 '19 at 09:38
  • sorting on a dataframe with `.sort('index')` has not worked for me, because it is not available anymore in latest versions of pandas. I had to use `.sort_values(by='index')`. – Bernhard Dec 25 '21 at 12:07
8

I might have a much more simple solution:

df_z = df_x.join(df_y.set_index('state'), on = 'state')

Hope it helps

Laurent T
  • 275
  • 3
  • 7
  • 1
    If `state` in the `y` DataFrame is sort of an index (or primary key equivalent of a database table) and what you are trying to use this index to add `value` to each `loan` with a state from `y`, then this solution is hands down the most proper one. – Mitali Cyrus Jan 20 '20 at 17:06
  • 1
    Yes, State is unique and present (KEY) for df_y in my example. The order will be the same than df_x without taking into account the order of df_y – Laurent T Jan 22 '20 at 01:35
4

The fastest way I've found to merge and restore order - if you are merging "left" - is to include the original order as a column in the left dataframe before merging, then use that to restore the order after merging:

import pandas
loans = [  'a',  'b', 'c' ]
states = [  'OR',  'CA', 'OR' ]
x = pandas.DataFrame({ 'loan' : loans, 'state' : states })
y = pandas.DataFrame({ 'state' : [ 'CA', 'OR' ], 'value' : [ 1, 2]})

import numpy as np
x["Order"] = np.arange(len(x))

z = x.merge(y, how='left', on='state').set_index("Order").ix[np.arange(len(x)), :]

This method is faster than sorting. Here it is as a function:

def mergeLeftInOrder(x, y, on=None):
    x = x.copy()
    x["Order"] = np.arange(len(x))
    z = x.merge(y, how='left', on=on).set_index("Order").ix[np.arange(len(x)), :]
    return z
KCzar
  • 1,024
  • 1
  • 9
  • 11
  • 1
    even (marginally) faster and more readable version: ``` def mergeLeftInOrder2(x, y, on=None): x = x.copy() x["Order"] = np.arange(len(x)) z = x.merge(y, how='left', on=on).sort("Order") return z.drop("Order", 1) ``` – Boris Gorelik Jul 14 '15 at 09:00
  • I actually can't even recreate the OP's problem anymore - doesn't seem to be a problem on pandas 0.15.2, so all timing issues would seem to be moot. I think I also would've had to make it .loc instead of .ix - oversight. When I originally came up with this solution (long time ago) I had a different version of pandas, so I figure I'll leave it as it is. – KCzar Jul 14 '15 at 21:06
  • I was able to recreate the issue by changing the input a bit. See below. The function here (mergeLeftInOrder) fixed it. import pandas loans = [ 'a', 'b', 'c', 'd' ] states = [ 'CA', 'IL', 'CA', 'OR' ] x = pandas.DataFrame({ 'loan' : loans, 'state' : states }) y = pandas.DataFrame({ 'state' : [ 'OR', 'CA' ], 'value' : [ 1, 2]}) z = x.merge(y, how='left', on='state') – roundsquare Oct 05 '18 at 14:30
3

Pandas has a merge_ordered function, so your solution is now as simple as:

z = pd.merge_ordered(x, y, on='state')
Asclepius
  • 57,944
  • 17
  • 167
  • 143
Claygirl
  • 339
  • 2
  • 7
0

I tried the following and it does preserve the original order of loans:

z = pandas.merge(x, y, on='state', how='left')

I hope it helps! Please do let me know if there are any drawbacks of my method. Thanks.

Ken Wee
  • 1
  • 2
-3

Use pd.merge_ordered(), documentation here.

For your example,

z = pd.merge_ordered(x, y, how='left', on='state')

EDIT: Just wanted to point out that default behavior for this function is an outer merge, different from the default behavior of the more common .merge()

filup
  • 195
  • 10
  • Does not preserve the order of the original dataframes but it simply sorts the output dataframe. – Outcast Jul 18 '19 at 09:40