1

My pandas DataFrame looks something like this:

---------------------------------------
 Name   | Stats                     
---------------------------------------
 Bob    | { age : 42, profession: IT }
 Jill   | { age : 35, profession: Engineer }
 Patric | { age : 37, profession: Student }
---------------------------------------

Where Stats is a class and age and profession are both properties on that class.

I'd like to sort that table by one of the properties on Stats. For example, sort it by the person's age so that the table looks like:

---------------------------------------
 Name   | Stats                     
---------------------------------------
 Jill   | { age : 35, profession: Engineer }
 Patric | { age : 37, profession: Student }
 Bob    | { age : 42, profession: IT }
---------------------------------------

Is there a way to do that with Pandas? I've only found ways to sort by the entire column

Thanks

Zain Rizvi
  • 23,586
  • 22
  • 91
  • 133

2 Answers2

2

A major point of using a dataframe is to type your columns for efficient storage and computational speed (e.g. int64, float64, object, etc). Your data is not well structured; you should likely have a separate column for each field in Stats. See Tidy Data for more info.

df2 = df[['Name']].assign(age=[d.get('age') for d in df['Stats']],
                          profession=[d.get('profession') for d in df['Stats']])

Then it is easy to work with your data.

>>> df2.sort_values('age')
     Name  age profession
1    Jill   35   Engineer
2  Patric   37    Student
0     Bob   42         IT
Alexander
  • 105,104
  • 32
  • 201
  • 196
2

Solution
You can use argsort to find the appropriate ordering and pass that to iloc. However, you need to create a dataframe in order to run argsort on the age column.

df.iloc[pd.DataFrame(df.Stats.values.tolist()).age.argsort()]

     Name                                  Stats
1    Jill  {'age': 35, 'profession': 'Engineer'}
2  Patric   {'age': 37, 'profession': 'Student'}
0     Bob        {'age': 42, 'profession': 'IT'}

After reading @Alexander's answer... I came up with something that combines his idea and mine. If anyone finds this part useful, please don't forget to upvote his answer.

df.iloc[np.argsort([x.get('age') for x in df.Stats])]

Timing
Small sample data

%timeit df.iloc[pd.DataFrame(df.Stats.values.tolist()).age.argsort()]
%timeit df.iloc[np.argsort([x.get('age') for x in df.Stats])]
%timeit df.iloc[np.argsort([x.get('age') for x in df.Stats.values.tolist()])]

1000 loops, best of 3: 756 µs per loop
1000 loops, best of 3: 225 µs per loop
1000 loops, best of 3: 207 µs per loop

Setup

df = pd.DataFrame(dict(
    Name='Bob Jill Patric'.split(),
    Stats=[
        dict(age=42, profession='IT'),
        dict(age=35, profession='Engineer'),
        dict(age=37, profession='Student')
    ]
))
piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • Thanks. I'm curious, what made you think to try df.Stats.values.tolist()? If anything, I would have expected that to add to the overhead, not reduce it – Zain Rizvi Aug 02 '17 at 21:36
  • 1
    @ZainRizvi experience with this stuff is what made me think of it. If I want to use a list comprehension, its quicker to iterate through a list than through either a pandas series or numpy array. [**See this question**](https://stackoverflow.com/q/40593444/2336654) – piRSquared Aug 02 '17 at 21:38