2

I have a dataFrame like this:

id  asn      orgs
0   3320    {'Deutsche Telekom AG': 2288}
1   47886   {'Joyent': 16, 'Equinix (Netherlands) B.V.': 7}
2   47601   {'fusion services': 1024, 'GCE Global Maritime':16859}  
3   33438   {'Highwinds Network Group': 893}

I would like to sort the 'orgs' column which is actually a dictionary and then extract get the pair(k,v) with the highest values in two different columns. Like this:

id  asn      org                      value
0   3320    'Deutsche Telekom AG'     2288
1   47886   'Joyent'                  16
2   47601   'GCE Global Maritime'     16859 
3   33438   'Highwinds Network Group' 893

Currently I am running this code but it does not properly sort, and then I am not sure how to extract the pair with highest value.

df.orgs.apply(lambda x : sorted(x.items(),key=operator.itemgetter(1),reverse=True))

which gave me a list like this:

id  asn      orgs
0   3320    [('Deutsche Telekom AG', 2288)]
1   47886   [('Joyent', 16),( 'Equinix (Netherlands) B.V.', 7)]
2   47601   [('GCE Global Maritime',16859),('fusion services', 1024)]   
3   33438   [('Highwinds Network Group', 893)]

Now how can I put the key and the value of the highest into two seperate columns? Can anybody help?

EdChum
  • 376,765
  • 198
  • 813
  • 562
UserYmY
  • 8,034
  • 17
  • 57
  • 71
  • Well what you're asking for is just the max value, the sorting is a bit irrelevant no? – EdChum Apr 20 '15 at 08:49
  • @EdChum no because I would like to have both the key and the value in separate columns of the pair with maximum value. – UserYmY Apr 20 '15 at 08:50

2 Answers2

2

Another approach define a function that just calls min on the dict and return a Series so you can assign to multiple columns (function body taken from @Alex Martelli's answer):

In [17]:

def func(x):
    k = min(x, key=x.get)
    return pd.Series([k, x[k]])
df[['orgs', 'value']] = df['orgs'].apply(func)
df

Out[17]:
     asn  id                        orgs  value
0   3320   0         Deutsche Telekom AG   2288
1  47886   1  Equinix (Netherlands) B.V.      7
2  47601   2             fusion services   1024
3  33438   3     Highwinds Network Group    893

EDIT

If your data has empty dicss, then you can just test the len:

In [34]:

df = pd.DataFrame({'id':[0,1,2,3,4],
                   'asn':[3320,47886,47601,33438,56],
                   'orgs':[{'Deutsche Telekom AG': 2288},
                           {'Joyent': 16, 'Equinix (Netherlands) B.V.': 7},
                           {'fusion services': 1024, 'GCE Global Maritime':16859},
                           {'Highwinds Network Group': 893},{}]})
df
Out[34]:
     asn  id                                               orgs
0   3320   0                      {'Deutsche Telekom AG': 2288}
1  47886   1    {'Equinix (Netherlands) B.V.': 7, 'Joyent': 16}
2  47601   2  {'GCE Global Maritime': 16859, 'fusion service...
3  33438   3                   {'Highwinds Network Group': 893}
4     56   4                                                 {}
In [36]:

def func(x):
    if len(x) > 0:
        k = min(x, key=x.get)
        return pd.Series([k, x[k]])
    return pd.Series([np.NaN, np.NaN])

df[['orgs', 'value']] = df['orgs'].apply(func)
df

Out[36]:
     asn  id                        orgs  value
0   3320   0         Deutsche Telekom AG   2288
1  47886   1  Equinix (Netherlands) B.V.      7
2  47601   2             fusion services   1024
3  33438   3     Highwinds Network Group    893
4     56   4                         NaN    NaN
Community
  • 1
  • 1
EdChum
  • 376,765
  • 198
  • 813
  • 562
1

This should work:

In [1]: import pandas as pd  
In [2]: import operator
In [3]: df = pd.DataFrame({ 'id' : [0,1,2,3],
   ...:                      'asn' : [3320, 47886, 47601, 33438],
   ...:                      'orgs' : [{'Deutsche Telekom AG': 2288}, {'Joyent': 16, 'Equinix (Netherlands) B.V.': 7}, {'fusion services': 1024, 'GCE Global Maritime':16859}, {'Highwinds Network Group': 893}]
   ...:                    })

In [4]: df.orgs, df['value'] = zip(*df.orgs.apply(lambda x : sorted(x.items(),key=operator.itemgetter(1),reverse=True)[0]))

In [5]: df
Out[5]:
     asn  id                     orgs  value
0   3320   0      Deutsche Telekom AG   2288
1  47886   1                   Joyent     16
2  47601   2      GCE Global Maritime  16859
3  33438   3  Highwinds Network Group    893

I used zip(* <first element of sorted dict items>) and assigned them to df.orgs and df.value.

For empty dictionaries:

In [3]: df = pd.DataFrame({ 'id' : [0,1,2,3],
   ...:                      'asn' : [3320, 47886, 47601, 33438],
   ...:                      'orgs' : [{'Deutsche Telekom AG': 2288}, {'Joyent': 16, 'Equinix (Netherlands) B.V.': 7}, {'fusion services': 1024, 'GCE Global Maritime':16859}, {}]
   ...:                    })
In [4]: df.orgs.apply(lambda x : sorted(x.items(),key=operator.itemgetter(1),reverse=True)[0] if len(x) else ('',''))
Out[4]:
0     (Deutsche Telekom AG, 2288)
1                    (Joyent, 16)
2    (GCE Global Maritime, 16859)
3                            (, )
Name: orgs, dtype: object

In [5]: df.orgs, df['value'] = zip(*df.orgs.apply(lambda x : sorted(x.items(),key=operator.itemgetter(1),reverse=True)[0] if len(x) else ('','')))

In [6]: df
Out[6]:
     asn  id                 orgs  value
0   3320   0  Deutsche Telekom AG   2288
1  47886   1               Joyent     16
2  47601   2  GCE Global Maritime  16859
3  33438   3
dting
  • 38,604
  • 10
  • 95
  • 114