2

I have a file with data as follows:

 NAME,    FOOD
"John", {"Apple":10, "Banana":10,  "Mango":5}
"Jack", {"Apple":5, "Mango":10}
"Jill", {"Apple:7, "Banana":15}

I want to find the highest value of each key contained in the column named food. For example, I want to print (or store it as a variable) something like follows: John eats most of the Apples, which is 10. Jill eats most of the Bananas, which is 15. Jack eats most of the Mango, which is 10.

I am aware of iterrows() or itertuples(), but I could not proceed after iterating over the rows. Is there any good method to follow when dealing with such data of >10000 rows?

kingmakerking
  • 2,017
  • 2
  • 28
  • 44

2 Answers2

1

I think here is main problem read csv to DataFrame, for parsing I use yaml:

import pandas as pd
from pandas.compat import StringIO
import yaml, csv

temp=u""""NAME,    FOOD
John", {"Apple":10, "Banana":10,  "Mango":5}
"Jack", {"Apple":5, "Mango":10}
"Jill", {"Apple":7, "Banana":15}"""
#after testing replace 'StringIO(temp)' to 'filename.csv'
df = pd.read_csv(StringIO(temp), 
                 header=None, 
                 sep='{', 
                 quoting =csv.QUOTE_NONE, 
                 skiprows=1, 
                 names=['a','b'])

df['b'] = ('{' + df['b']).apply(yaml.load)
df['a'] = df['a'].str.strip(', "')
print (df)    
      a                                        b
0  John  {'Apple': 10, 'Mango': 5, 'Banana': 10}
1  Jack                {'Apple': 5, 'Mango': 10}
2  Jill               {'Apple': 7, 'Banana': 15}

print (type(df.loc[0,'b']))
<class 'dict'>

Then create new DataFrame by constructor and get idxmax and max values per rows:

df1 = pd.DataFrame(df['b'].values.tolist())
df['b'] = df1.idxmax(1)
df['c'] = df1.max(1).astype(int)
print (df)
      a       b   c
0  John   Apple  10
1  Jack   Mango  10
2  Jill  Banana  15

EDIT:

Maybe also works json parser:

import json, csv

temp=u""""NAME,    FOOD
John", {"Apple":10, "Banana":10,  "Mango":5}
"Jack", {"Apple":5, "Mango":10}
"Jill", {"Apple":7, "Banana":15}"""
#after testing replace 'StringIO(temp)' to 'filename.csv'
df = pd.read_csv(StringIO(temp), 
                 header=None, 
                 sep='{', 
                 quoting =csv.QUOTE_NONE, 
                 skiprows=1, 
                 names=['a','b'])

df['b'] = ('{' + df['b']).apply(json.loads)
df['a'] = df['a'].str.strip(', "')
print (df) 
      a                                        b
0  John  {'Mango': 5, 'Apple': 10, 'Banana': 10}
1  Jack                {'Mango': 10, 'Apple': 5}
2  Jill               {'Apple': 7, 'Banana': 15}

print (type(df.loc[0,'b']))
<class 'dict'>
Community
  • 1
  • 1
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
0

I didn't get the chance to run this on your actual file, so you'll have to deal with reading dictionaries that are stored as string in your file into actual dictionaries (perhaps use something like eval), but this ought to work:

data = {}

with open('file.csv', 'r') as f:
    for idx, line in enumerate(f):
        if idx > 0:
            name, line_data = line.split(',')
            data[name] = eval(line_data)

for k, v in data.items():
    top = max(v.items(), key=lambda x: x[1])
    print('%s eats the most of %s (n = %d)' % (k, top[0], int(top[1]))
blacksite
  • 12,086
  • 10
  • 64
  • 109