16

Say I have two dataframes x and y in Pandas, I would like to fill in a column in x with the result of sorting a column in y. I tried this:

x['foo']  = y['bar'].order(ascending=False)

but it didn't work, I suspect because Pandas aligns indices between x and y (which have the same set of indices) during the assignment

How can I have Pandas fill in the x['foo'] with another column from another dataframe ignoring the alignment of indices?

Amelio Vazquez-Reina
  • 91,494
  • 132
  • 359
  • 564

3 Answers3

11

The simplest way I can think of to get pandas to ignore the indices is to give it something without indices to ignore. Starting from

>>> x = pd.DataFrame({"foo": [10,20,30]},index=[1,2,0])
>>> y = pd.DataFrame({"bar": [33,11,22]},index=[0,1,2])
>>> x
   foo
1   10
2   20
0   30
>>> y
   bar
0   33
1   11
2   22

We have the usual aligned approach:

>>> x["foo"] = y["bar"].order(ascending=False)
>>> x
   foo
1   11
2   22
0   33

Or an unaligned one, by setting x["foo"] to a list:

>>> x["foo"] = y["bar"].order(ascending=False).tolist()
>>> x
   foo
1   33
2   22
0   11
DSM
  • 342,061
  • 65
  • 592
  • 494
  • 2
    I see, thanks. Would `my_series.tolist()` be better/worse than `np.array(my_series)` or `my_series.values()` ? – Amelio Vazquez-Reina Apr 12 '13 at 20:17
  • 5
    Except in some boundary cases I don't know if it would make much difference. Come to think of it, `my_series.values` might be better at that, as it shouldn't have to go via a Python list, and so could be faster. [Checks.. yeah, at least sometimes it's faster.] – DSM Apr 12 '13 at 20:24
  • What's the difference between `to_list` and `tolist`? I suspect they're the same, but how can I make sure? – rishai Sep 29 '19 at 08:09
  • @AmelioVazquez-Reina Note that it's just `.values` without parentheses – BallpointBen Mar 28 '20 at 15:14
1

I tried the code but it seems that the order() method has been deprecated, which is no surprise since the initial question is quite old. So now we are left with sort_values() to achieve the same result. On top of that there's a refinement which consist in using to_numpy() since it is slightly faster and could be useful in case of big DataFrames (.values is even faster but it is recommended to use to_numpy() whenever working for production as explained here: https://stackoverflow.com/a/54324513/4909087)

>>> x["foo"] = y["bar"].sort_values(ascending=False)
>>> x
   foo
1   33
2   22
0   11

>>> %timeit x["foo"] = y["bar"].sort_values(ascending=False).to_list()
165 µs ± 965 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
>>> %timeit x["foo"] = y["bar"].sort_values(ascending=False).to_numpy()
136 µs ± 421 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
>>> %timeit x["foo"] = y["bar"].sort_values(ascending=False).values
129 µs ± 826 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Iqigai
  • 326
  • 3
  • 10
1

Simplest:

x['foo']  = y['bar'].sort_values(ascending=False).values

Adding .values at the end allows you to ignore indices

PJ_
  • 473
  • 5
  • 9