3

I want to sort a nested dict in pyhon via pandas.

import pandas as pd 

# Data structure (nested list):
# {
#   category_name: [[rank, id], ...],
#   ...
# }

all_categories = {
    "category_name1": [[2, 12345], [1, 32512], [3, 32382]],
    "category_name2": [[3, 12345], [9, 25318], [1, 24623]]
}

df = pd.DataFrame(all_categories.items(), columns=['Category', 'Rank'])
df.sort_values(['Rank'], ascending=True, inplace=True) # this only sorts the list of lists

Can anyone tell me how I can get to my goal? I can't figure it out. Via panda it's possible to sort_values() by the second column, but I can't figure out how to sort the nested dict/list.

I want to sort ascending by the rank, not the id.

Patrick
  • 97
  • 8

4 Answers4

5

The fastest option is to apply sort() (note that the sorting occurs in place, so don't assign back to df.Rank in this case):

df.Rank.apply(list.sort)

Or apply sorted() with a custom key and assign back to df.Rank:

df.Rank = df.Rank.apply(lambda row: sorted(row, key=lambda x: x[0]))

Output in either case:

>>> df
         Category                                  Rank
0  category_name1  [[1, 32512], [2, 12345], [3, 32382]]
1  category_name2  [[1, 24623], [3, 12345], [9, 25318]]

This is the perfplot of sort() vs sorted() vs explode():

timing results

import perfplot

def explode(df):
    df = df.explode('Rank')
    df['rank_num'] = df.Rank.str[0]
    df = df.sort_values(['Category', 'rank_num']).groupby('Category', as_index=False).agg(list)
    return df

def apply_sort(df):
    df.Rank.apply(list.sort)
    return df

def apply_sorted(df):
    df.Rank = df.Rank.apply(lambda row: sorted(row, key=lambda x: x[0]))
    return df

perfplot.show(
    setup=lambda n: pd.concat([df] * n),
    n_range=[2 ** k for k in range(25)],
    kernels=[explode, apply_sort, apply_sorted],
    equality_check=None,
)

To filter rows by list length, mask the rows with str.len() and loc[]:

mask = df.Rank.str.len().ge(10)
df.loc[mask, 'Rank'].apply(list.sort)
tdy
  • 36,675
  • 19
  • 86
  • 83
  • 1
    Thank you very much, also for the insights and the perfplot. I'm just curious: you know how to ignore all `Rank` entrys with less then `N=10` entries. – Patrick Jun 14 '21 at 10:38
  • 1
    @Patrick you're welcome. to filter lists by length, you can mask those rows with `str.len()` and `loc[]` (answer updated) – tdy Jun 14 '21 at 16:16
  • Thank you again, but the code doesn't count or show only entries with `n>10`, it's the same output from before. I mean the `len(all_categories['Rank']) > 10` – Patrick Jun 16 '21 at 15:06
  • @Patrick correct, the current code just limits the sorting to the masked rows but still retains all the rows. if you want those other rows to be removed, you can do something like this instead: `df = df.loc[mask]; df.Rank.apply(list.sort)` – tdy Jun 16 '21 at 19:02
1

Try

df = pd.DataFrame(all_categories.items(), columns=['Category', 'Rank']).explode('Rank')
df['Rank'] = df['Rank'].apply(lambda x: sorted(x))

df = df.groupby('Category').agg(list).reset_index()

to dict

dict(df.agg(list, axis=1).values)
he xiao
  • 11
  • 1
0

Try:

df = pd.DataFrame(all_categories.items(), columns=['Category', 'Rank'])
df.set_index('Rank', inplace=True)
df.sort_index(inplace=True)
df.reset_index(inplace=True)

Or:

df = pd.DataFrame(all_categories.items(), columns=['Category', 'Rank'])
df = df.set_index('Rank').sort_index().reset_index()
irc1209
  • 51
  • 6
  • It's not working, same result as above, if I sort the list of lists. It's not even sorted by `id` – Patrick Jun 13 '21 at 16:03
0

It is much more efficient to use df.explode and then sort the values. It will be vectorized.

df = df.explode('Rank')
df['rank_num'] = df.Rank.str[0]

df.sort_values(['Category', 'rank_num'])
  .groupby('Category', as_index=False)
  .agg(list)

Output

         Category                                  Rank   rank_num
0  category_name1  [[1, 32512], [2, 12345], [3, 32382]]  [1, 2, 3]
1  category_name2  [[1, 24623], [3, 12345], [9, 25318]]  [1, 3, 9]
Vishnudev Krishnadas
  • 10,679
  • 2
  • 23
  • 55
  • i did some timings and `explode` seems to be slower than `apply` in this case (i guess because `explode` still requires `groupby`+`agg`) – tdy Jun 14 '21 at 05:40