0

I have two columns of data restaurant name and reviewer grades:

   name                            grades
0  Honey'S Thai Pavilion           [{u'date': 2014-08-12 00:00:00, u'grade'..  
1  Siam Sqaure Thai Cuisine        [{u'date': 2014-11-06 00:00:00, u'grade'...

The problem is one column is a list of multiple 'date,' 'grade' and 'score' pairings in JSON (well technically BSON since this is the sample data set from the MongoDB tutorial). I need to break out the grades column so I have a resulting data frame like below:

name                       Date                   Grade         Score
Honey'S Thai Pavilion      2014-08-12 00:00:00    A             6
Honey'S Thai Pavilion      2015-03-14 00:00:00    B             5
Honey'S Thai Pavilion      2013-07-15 00:00:00    C             6
Siam Sqaure Thai Cuisine   2014-11-06 00:00:00    A             3
Siam Sqaure Thai Cuisine   2015-06-06 00:00:00    B             2

So I need to split out one column but retain the restaurant name. The code below achieves getting the grades column into a nice looking data frame, but I can't figure out how to keep the restaurant name.

    from pymongo import MongoClient
    import pymongo
    import pandas as pd

    client = MongoClient()

    db = client.test

)
    cursor2 = db.restaurants.find().sort([
        ("borough", pymongo.ASCENDING),
        ("cuisine", pymongo.DESCENDING)
    ])

    #cursor.sort("cuisine",pymongo.ASCENDING)
    data = pd.DataFrame(list(cursor2))[['name', 'grades']]

    data_list= []
    for i in range(0, len(data.grades)):
        g_data = pd.DataFrame(data.grades[i])
        data_list.append(g_data)

    result = pd.concat(data_list)
    print result.head(100)
user2907249
  • 839
  • 7
  • 14
  • 32
  • I don't have mongo installed. Is the 1st box you show the first rows of `data` in your code? If it is, I think it would be more or less easy to solve with `apply`, there's many questions around but probably [this one](http://stackoverflow.com/questions/16236684/apply-pandas-function-to-column-to-create-multiple-new-columns) will help you. If you find issues please post them. Hope it helps. – lrnzcig Sep 24 '15 at 14:40

1 Answers1

1

Don't know pandas much but you can flatten your results from the mongo cursor using a generator expression and then feed the generator to pandas data frame like this:

flattened_data = (
    {
        'name': record['name'],
        'date': grade['date'],
        'grade': grade['grade'],
        'score': grade.get('score')
    }
    for record in cursor2
    for grade in record['grades']
)
result = pd.DataFrame(flattened_data)[['name', 'date', 'grade', 'score']]
print result.head(100)

This way, you dont need to build the data_list list over that for loop.

dopstar
  • 1,478
  • 10
  • 20
  • That works but why did you use "grade.get('score')" rather than just grade['score'] like the others? I tried it both ways and it seems to work. – user2907249 Sep 24 '15 at 22:19
  • Because I was not exactly sure of how you data actually looked like. You did not show score in your data. So grade.get('score') would not fail your script if you did not have the score key but grade['score'] would raise a KeyError. – dopstar Sep 25 '15 at 12:21