37

I am new to pandas (well, to all things "programming"...), but have been encouraged to give it a try. I have a mongodb database - "test" - with a collection called "tweets". I access the database in ipython:

import sys
import pymongo
from pymongo import Connection
connection = Connection()
db = connection.test
tweets = db.tweets

the document structure of documents in tweets is as follows:

entities': {u'hashtags': [],
  u'symbols': [],
  u'urls': [],
  u'user_mentions': []},
 u'favorite_count': 0,
 u'favorited': False,
 u'filter_level': u'medium',
 u'geo': {u'coordinates': [placeholder coordinate, -placeholder coordinate], u'type': u'Point'},
 u'id': 349223842700472320L,
 u'id_str': u'349223842700472320',
 u'in_reply_to_screen_name': None,
 u'in_reply_to_status_id': None,
 u'in_reply_to_status_id_str': None,
 u'in_reply_to_user_id': None,
 u'in_reply_to_user_id_str': None,
 u'lang': u'en',
 u'place': {u'attributes': {},
  u'bounding_box': {u'coordinates': [[[placeholder coordinate, placeholder coordinate],
     [-placeholder coordinate, placeholder coordinate],
     [-placeholder coordinate, placeholder coordinate],
     [-placeholder coordinate, placeholder coordinate]]],
   u'type': u'Polygon'},
  u'country': u'placeholder country',
  u'country_code': u'example',
  u'full_name': u'name, xx',
  u'id': u'user id',
  u'name': u'name',
  u'place_type': u'city',
  u'url': u'http://api.twitter.com/1/geo/id/1820d77fb3f65055.json'},
 u'retweet_count': 0,
 u'retweeted': False,
 u'source': u'<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>',
 u'text': u'example text',
 u'truncated': False,
 u'user': {u'contributors_enabled': False,
  u'created_at': u'Sat Jan 22 13:42:59 +0000 2011',
  u'default_profile': False,
  u'default_profile_image': False,
  u'description': u'example description',
  u'favourites_count': 100,
  u'follow_request_sent': None,
  u'followers_count': 100,
  u'following': None,
  u'friends_count': 100,
  u'geo_enabled': True,
  u'id': placeholder_id,
  u'id_str': u'placeholder_id',
  u'is_translator': False,
  u'lang': u'en',
  u'listed_count': 0,
  u'location': u'example place',
  u'name': u'example name',
  u'notifications': None,
  u'profile_background_color': u'000000',
  u'profile_background_image_url': u'http://a0.twimg.com/images/themes/theme19/bg.gif',
  u'profile_background_image_url_https': u'https://si0.twimg.com/images/themes/theme19/bg.gif',
  u'profile_background_tile': False,
  u'profile_banner_url': u'https://pbs.twimg.com/profile_banners/241527685/1363314054',
  u'profile_image_url':       u'http://a0.twimg.com/profile_images/378800000038841219/8a71d0776da0c48dcc4ef6fee9f78880_normal.jpeg',
  u'profile_image_url_https':     u'https://si0.twimg.com/profile_images/378800000038841219/8a71d0776da0c48dcc4ef6fee9f78880_normal.jpeg', 
  u'profile_link_color': u'000000',
  u'profile_sidebar_border_color': u'FFFFFF',
  u'profile_sidebar_fill_color': u'000000',
  u'profile_text_color': u'000000',
  u'profile_use_background_image': False,
  u'protected': False,
  u'screen_name': placeholder screen_name',
  u'statuses_count': xxxx,
  u'time_zone': u'placeholder time_zone',
  u'url': None,
  u'utc_offset': -21600,
  u'verified': False}}

Now, as far as I understand, pandas' main data structure - a spreadsheet-like table - is called DataFrame. How can I load the data from my "tweets" collection into pandas' DataFrame? And how can I query for a subdocument within the database?

user2161725
  • 667
  • 2
  • 7
  • 12
  • There ought to be a way to do this using read_json, which would be more effecient (especially for large datasets). – Andy Hayden Jul 23 '13 at 10:22

5 Answers5

57

Comprehend the cursor you got from the MongoDB before passing it to DataFrame

import pandas as pd
df = pd.DataFrame(list(tweets.find()))
Henry Ecker
  • 34,399
  • 18
  • 41
  • 57
waitingkuo
  • 89,478
  • 28
  • 112
  • 118
  • Great, by passing "df" the documents of the collection are brought up in a data column. However, I need to query for a subdocument - "hashtags.text" - in one of the documents, "entities". Any idea how I can do this from within pandas? – user2161725 Jul 23 '13 at 09:03
  • Can you show some examples for your documents so that I can give you a help? – waitingkuo Jul 23 '13 at 09:09
  • What do you need? The hashtags field? – waitingkuo Jul 23 '13 at 09:23
  • Yes, I am interested in the hashtags field. – user2161725 Jul 23 '13 at 09:33
  • What's your expect output? What're your columns and rows? – waitingkuo Jul 23 '13 at 09:36
  • I'd like pandas to give me an overview of who tweets with hashtags and how many times they have done it. One way to do this would be to have a dataframe structure with names of user ("screen_name") in the rows, and frequency/count of hashtagged tweets in the columns. – user2161725 Jul 23 '13 at 09:42
  • Perhaps you should post a new problem about what you've just asked. Provide some sample examples and the expected output format. – waitingkuo Jul 23 '13 at 11:00
  • is there a way to avoid list() constructor and use the generator directly ? – Frederic Bazin Apr 22 '15 at 17:35
  • 1
    I have a collection with 283000 rows, each with 10 columns (5 doubles, 2 Longs, 2 Strings and 1 ISODate). It takes 3-5 seconds to give me the DataFrame. I expected this to take around zero seconds. I see that `list()` takes the most of the time. Is that expected or do I have some bad configuration somewhere? (FYI I am reading the whole collection, i.e., using `find()`) – Mahdi May 23 '16 at 11:52
  • @Mahdi I'm having the same problem – Jérémy Talbot-Pâquet Nov 27 '19 at 22:41
35

If you have data in MongoDb like this:

[
    {
        "name": "Adam", 
        "age": 27, 
        "address":{
            "number": 4, 
            "street": "Main Road", 
            "city": "Oxford"
        }
     },
     {
        "name": "Steve", 
        "age": 32, 
        "address":{
            "number": 78, 
            "street": "High Street", 
            "city": "Cambridge"
        }
     }
]

You can put the data straight into a dataframe like this:

from pandas import DataFrame

df = DataFrame(list(db.collection_name.find({}))

And you will get this output:

df.head()

|    | name    | age  | address                                                   |
|----|---------|------|-----------------------------------------------------------|
| 1  | "Steve" | 27   | {"number": 4, "street": "Main Road", "city": "Oxford"}    | 
| 2  | "Adam"  | 32   | {"number": 78, "street": "High St", "city": "Cambridge"}  |

However the subdocuments will just appear as JSON inside the subdocument cell. If you want to flatten objects so that subdocument properties are shown as individual cells you can use json_normalize without any parameters.

from pandas.io.json import json_normalize

datapoints = list(db.collection_name.find({})

df = json_normalize(datapoints)

df.head()

This will give the dataframe in this format:

|    | name   | age  | address.number | address.street | address.city |
|----|--------|------|----------------|----------------|--------------|
| 1  | Thomas | 27   |     4          | "Main Road"    | "Oxford"     |
| 2  | Mary   | 32   |     78         | "High St"      | "Cambridge"  |
Henry Ecker
  • 34,399
  • 18
  • 41
  • 57
Mark Unsworth
  • 3,027
  • 2
  • 20
  • 21
  • If we dont want to mention collection name then how to fetch data for all collections?? – TB.M Jan 03 '18 at 07:20
  • Will this work for like several GB of data from MongoDB ? or does the Pandas Dataframe suffer and we need to try another approach ? Like I have a tweets JSON data of almost 15 GB imported into MongoDB and I am trying to get it into a CSV. – Rahul Saini Jun 28 '19 at 09:30
  • Traceback File "C:\DEV\Python\lib\site-packages\pymongo\network.py", line 235, in _receive_data_on_socket buf = bytearray(length) MemoryError – Rahul Saini Jun 28 '19 at 09:56
  • ``` result_df = pd.json_normalize( # data=json.loads(raw_json_line_text)) data=pymongo_collection.find() # data=tuple(pymongo_collection.find()) )``` works just fine without converting | reading pymongo cursor to list or tuple . – Alex Glukhovtsev Apr 01 '21 at 13:22
  • Loading a large collection leads to memory issue. Is there a way I can pull data in small batches? – Ravi Jain Jan 12 '23 at 09:31
5

You can load your MongoDB data to pandas DataFame using this code. It works for me.

import pymongo
import pandas as pd
from pymongo import Connection
connection = Connection()
db = connection.database_name
input_data = db.collection_name
data = pd.DataFrame(list(input_data.find()))
Henry Ecker
  • 34,399
  • 18
  • 41
  • 57
saimadhu.polamuri
  • 4,439
  • 2
  • 24
  • 21
  • here we are mentioned the collection name.If we dont want to mention collection name then how we can archive that.? – TB.M Jan 03 '18 at 07:05
1

Use:

df=pd.DataFrame.from_dict(collection)
  • There is a lack of context. For example, if I try it as you say without any context, I get the following error: "TypeError: 'Collection' object is not iterable". – Sorul Nov 23 '21 at 08:57
1

This is the simplest technique to achieve your aim.

import pymongo
import pandas as pd
from pymongo import Connection
conn = Connection()
db = conn.your_database_name
input_data = db.your_collection_name
pandas_data_frame = pd.DataFrame(list(input_data.find()))
print(pandas_data_frame)
Henry Ecker
  • 34,399
  • 18
  • 41
  • 57