0

I simply want to reorder the rows of my pandas dataframe such that col1 matches the order of the external list elements in my_order.

d = {'col1': ['A', 'B', 'C'], 'col2': [1,2,3]}
df = pd.DataFrame(data=d)
my_order = ['B', 'C', 'A']

This post sorting by a custom list in pandas does the order work sorting by a custom list in pandas and using it for my data produces

d = {'col1': ['A', 'B', 'C'], 'col2': [1,2,3]}
df = pd.DataFrame(data=d)
my_order = ['B', 'C', 'A']
df.col1 = df.col1.astype("category")
df.col1.cat.set_categories(my_order, inplace=True)
df.sort_values(["col1"])

However, this seems to be a wasteful amount of code relative to an R process which would simply be

df = data.frame(col1 = c('A','B','C'), col2 = c(1,2,3))
my_order = c('B', 'C', 'A')
df[match(my_order, df$col1),]

Ordering is expensive and the python version above takes 3 steps where R takes only 1 using the match function. Can python not rival R in this case?

If this were simply done once in my real world example I wouldn't care much. But, this is a process that will be iterated on millions of times on a web server application and so a truly minimal, inexpensive path is the best approach

user350540
  • 429
  • 5
  • 17
  • What do you mean by inexpensive? Lines of code is no indication of efficiency. What makes you think the R code is more efficient? Also note that the solution of setting the variable to be categorical has generally good performance implications, will lead to lower memory use and is just a good idea regardless of if you're sorting or not. – Dan Jul 07 '20 at 13:10
  • It would help if you explained how you are creating your dataframe in the first place. You may well be able to specify col1 as categorical then. – Dan Jul 07 '20 at 13:15

2 Answers2

0

We do have the some thing in pandas pd.Categorical + argsort

df.iloc[pd.Categorical(df.col1,my_order).argsort()]
  col1  col2
1    B     2
2    C     3
0    A     1

Update

df.iloc[df.col1.map(dict(zip(my_order,range(len(my_order))))).argsort()]
  col1  col2
1    B     2
2    C     3
0    A     1
BENY
  • 317,841
  • 20
  • 164
  • 234
  • Why would you do this over setting col1 to have type `Categorical`? It seems very wasteful since you're doing all the work to create it anyway. – Dan Jul 07 '20 at 13:14
  • @Dan do you feel better ? – BENY Jul 07 '20 at 13:17
  • I mean why not just make col1 have a Categorical dtype. I see no advantage of either of these methods over that. – Dan Jul 07 '20 at 13:22
  • @Dan I see the advantage , since do not change the original data type – BENY Jul 07 '20 at 13:23
  • Yes but why wouldn't you want to change the original datatype? Categorical is almost certainly better than object. – Dan Jul 07 '20 at 13:25
0

I don't understand why you don't like the python version. Just because you decided to write it in more lines than in R? You didn't have to:

from pandas.api.types import CategoricalDtype

df = pd.DataFrame({'col1': ['A', 'B', 'C'], 'col2': [1,2,3]})
df["col1"] = df["col1"].astype(CategoricalDtype(['B', 'C', 'A'], True))
df.sort_values(["col1"])

This is the same solution you posted, I just don't see what about it you consider worse than R? Using the categorical datatype will also use less memory, so I'm not sure why you wouldn't want to do it this way?

Dan
  • 45,079
  • 17
  • 88
  • 157
  • Looks to be a very good solution. Generally there is some overhead cost with type conversion, and in the R example I gave, there is no need to do so. In the python case it *seems* that is needed, at least in the way I have written the code. The cost might be small in this one instance, but accumulated over millions of operations it could accumulate, am I correct on that? – user350540 Jul 07 '20 at 19:55
  • Unless you profile both this code and the R code, it's really difficult to say if you are correct or not. I doubt this will be slower than R. I would also warn you against premature optimization, write the code and stress test it with a realistic load before you decide it might be too slow. – Dan Jul 08 '20 at 12:40