126

I have a large amount of data in a collection in mongodb which I need to analyze. How do i import that data to pandas?

I am new to pandas and numpy.

EDIT: The mongodb collection contains sensor values tagged with date and time. The sensor values are of float datatype.

Sample Data:

{
"_cls" : "SensorReport",
"_id" : ObjectId("515a963b78f6a035d9fa531b"),
"_types" : [
    "SensorReport"
],
"Readings" : [
    {
        "a" : 0.958069536790466,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:26:35.297Z"),
        "b" : 6.296118156595,
        "_cls" : "Reading"
    },
    {
        "a" : 0.95574014778624,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:27:09.963Z"),
        "b" : 6.29651468650064,
        "_cls" : "Reading"
    },
    {
        "a" : 0.953648289182713,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:27:37.545Z"),
        "b" : 7.29679823731148,
        "_cls" : "Reading"
    },
    {
        "a" : 0.955931884300997,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:28:21.369Z"),
        "b" : 6.29642922525632,
        "_cls" : "Reading"
    },
    {
        "a" : 0.95821381,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:41:20.801Z"),
        "b" : 7.28956613,
        "_cls" : "Reading"
    },
    {
        "a" : 4.95821335,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:41:36.931Z"),
        "b" : 6.28956574,
        "_cls" : "Reading"
    },
    {
        "a" : 9.95821341,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:42:09.971Z"),
        "b" : 0.28956488,
        "_cls" : "Reading"
    },
    {
        "a" : 1.95667927,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:43:55.463Z"),
        "b" : 0.29115237,
        "_cls" : "Reading"
    }
],
"latestReportTime" : ISODate("2013-04-02T08:43:55.463Z"),
"sensorName" : "56847890-0",
"reportCount" : 8
}
shx2
  • 61,779
  • 13
  • 130
  • 153
Nithin
  • 2,223
  • 4
  • 21
  • 29
  • Using [a custom field type](https://gist.github.com/jdthorpe/93145e8093258a3b73b2bd458533176d) with MongoEngine can make storing and retrieving Pandas DataFrames as simple as `mongo_doc.data_frame = my_pandas_df` – Jthorpe Jul 15 '17 at 16:04

15 Answers15

173

pymongo might give you a hand, followings is some code I'm using:

import pandas as pd
from pymongo import MongoClient


def _connect_mongo(host, port, username, password, db):
    """ A util for making a connection to mongo """

    if username and password:
        mongo_uri = 'mongodb://%s:%s@%s:%s/%s' % (username, password, host, port, db)
        conn = MongoClient(mongo_uri)
    else:
        conn = MongoClient(host, port)


    return conn[db]


def read_mongo(db, collection, query={}, host='localhost', port=27017, username=None, password=None, no_id=True):
    """ Read from Mongo and Store into DataFrame """

    # Connect to MongoDB
    db = _connect_mongo(host=host, port=port, username=username, password=password, db=db)

    # Make a query to the specific DB and Collection
    cursor = db[collection].find(query)

    # Expand the cursor and construct the DataFrame
    df =  pd.DataFrame(list(cursor))

    # Delete the _id
    if no_id:
        del df['_id']

    return df
starball
  • 20,030
  • 7
  • 43
  • 238
waitingkuo
  • 89,478
  • 28
  • 112
  • 118
  • Thanks, this is the method i ended up using. I also had an array of embedded documents in each row. So I had to iterate that as well within each row. Is there a better way to do this?? – Nithin Apr 29 '13 at 06:42
  • Is it possible to provide some samples of your mongodb's structure? – waitingkuo Apr 29 '13 at 07:01
  • Look at edit for a sample data row. An array of the embedded document "Reading" is stored inside readings. Now i am doing a query to find multiple records and then iterate through each reading in the readings array for each and every record. Is there any easier way to import data in my case? – Nithin Apr 29 '13 at 10:43
  • 3
    Note the `list()` inside `df = pd.DataFrame(list(cursor))` evaluates as a list or generator, to keep the CPU cool. If u have a zillionty-one data items, and the next few lines would have reasonably partioned, level-of-detailed, and clipped them, the whole shmegegge is still safe to drop in. Nice. – Phlip Sep 22 '15 at 13:33
  • 3
    It's very slow @ `df = pd.DataFrame(list(cursor))`. Pure db quering is much faster. Could we change `list` casting to something else? – Peter.k Jan 19 '19 at 19:53
  • 2
    @Peter that line also caught my eyes. Casting a database cursor, which is designed to be iterable and potentially wraps large amounts of data, into an in-memory list does not seem clever to me. – Rafa Viotti Apr 26 '19 at 22:26
  • I have a cursor with ~150k rows and the list(cursor) alone takes 20s while the query takes 0.01s. Is there a faster way to load mongodb data into a pandas dataframe? – Jérémy Talbot-Pâquet Nov 27 '19 at 22:47
  • @Peter.k @Rafa Use iter: `pd.DataFrame(iter(cursor))`. Unless you want to use something like dask to never have *all* the data loaded into memory (for the most part). – Noah May Feb 07 '21 at 06:19
52

You can load your mongodb data to pandas DataFrame using this code. It works for me. Hopefully for you too.

import pymongo
import pandas as pd
from pymongo import MongoClient
client = MongoClient()
db = client.database_name
collection = db.collection_name
data = pd.DataFrame(list(collection.find()))
orluke
  • 2,041
  • 1
  • 17
  • 15
saimadhu.polamuri
  • 4,439
  • 2
  • 24
  • 21
27

As per PEP, simple is better than complicated:

import pandas as pd
df = pd.DataFrame.from_records(db.<database_name>.<collection_name>.find())

You can include conditions as you would working with regular mongoDB database or even use find_one() to get only one element from the database, etc.

and voila!

Cy Bu
  • 1,401
  • 2
  • 22
  • 33
  • pd.DataFrame.from_records seems to be as slow as DataFrame(list()), but the results are very inconsistent. %%time showed anything from 800 ms to 1.9 s – Andrei Drynov May 14 '19 at 14:35
  • 2
    This isnt good for huge records as this doesnot shows memory error, instread hangs the system for too big data. while pd.DataFrame(list(cursor)) shows memory error. – Amulya Acharya Jun 25 '19 at 03:45
25

Monary does exactly that, and it's super fast. (another link)

See this cool post which includes a quick tutorial and some timings.

shx2
  • 61,779
  • 13
  • 130
  • 153
  • Does Monary support string data type ? – Snehal Parmar Jan 01 '15 at 07:44
  • I tried Monary, but it is taking a lot of time. Am I missing some optimization? Tried `client = Monary(host, 27017, database="db_tmp") columns = ["col1", "col2"] data_type = ["int64", "int64"] arrays = client.query("db_tmp", "coll", {}, columns, data_type)` For `50000` records takes around `200s`. – nishant Nov 27 '17 at 11:09
  • That sounds extremely slow... Frankly, I don't know what the status of this project is, now, 4 years later... – shx2 Nov 27 '17 at 13:54
17

Another option I found very useful is:

from pandas.io.json import json_normalize

cursor = my_collection.find()
df = json_normalize(cursor)

(or json_normalize(list(cursor)), depending on your python/pandas versions).

This way you get the unfolding of nested mongodb documents for free.

Skippy le Grand Gourou
  • 6,976
  • 4
  • 60
  • 76
Ikar Pohorský
  • 4,617
  • 6
  • 39
  • 56
  • 2
    I got an error with this method `TypeError: data argument can't be an iterator` – Gabriel Fair Apr 02 '18 at 11:50
  • 2
    Strange, this works on my python `3.6.7` using pandas `0.24.2`. Maybe you can try `df = json_normalize(list(cursor))` instead? – Ikar Pohorský Jun 24 '19 at 08:20
  • For +1. docs, max_level argument defines max level of dict depth. I just made a test and it's not true, so some columns would need to be split with .str accesrors. Still, very nice feature for working with mongodb. – Mauricio Maroto Jun 11 '20 at 22:53
15
import pandas as pd
from odo import odo

data = odo('mongodb://localhost/db::collection', pd.DataFrame)
Baum mit Augen
  • 49,044
  • 25
  • 144
  • 182
fengwt
  • 189
  • 2
  • 8
10

For dealing with out-of-core (not fitting into RAM) data efficiently (i.e. with parallel execution), you can try Python Blaze ecosystem: Blaze / Dask / Odo.

Blaze (and Odo) has out-of-the-box functions to deal with MongoDB.

A few useful articles to start off:

And an article which shows what amazing things are possible with Blaze stack: Analyzing 1.7 Billion Reddit Comments with Blaze and Impala (essentially, querying 975 Gb of Reddit comments in seconds).

P.S. I'm not affiliated with any of these technologies.

Dennis Golomazov
  • 16,269
  • 5
  • 73
  • 81
  • 1
    I've also written a [post](https://github.com/goldan/data-science/blob/master/notebooks/dask-performance.ipynb) using Jupyter Notebook with an example how Dask helps to speedup execution even on a data fitting into memory by using multiple cores on a single machine. – Dennis Golomazov Sep 27 '16 at 23:53
  • Looks like blaze is deprecated. – wordsforthewise Jan 20 '23 at 20:03
6

Using

pandas.DataFrame(list(...))

will consume a lot of memory if the iterator/generator result is large

better to generate small chunks and concat at the end

def iterator2dataframes(iterator, chunk_size: int):
  """Turn an iterator into multiple small pandas.DataFrame

  This is a balance between memory and efficiency
  """
  records = []
  frames = []
  for i, record in enumerate(iterator):
    records.append(record)
    if i % chunk_size == chunk_size - 1:
      frames.append(pd.DataFrame(records))
      records = []
  if records:
    frames.append(pd.DataFrame(records))
  return pd.concat(frames)
Deo Leung
  • 848
  • 9
  • 9
5

You can also use pymongoarrow -- it's an official library offered by MongoDB for exporting mongodb data to pandas, numPy, parquet files, etc.

  • this library is almost useless due to very limited type support, it does not even support str. – Wang Jul 11 '21 at 13:48
3

http://docs.mongodb.org/manual/reference/mongoexport

export to csv and use read_csv or JSON and use DataFrame.from_records()

jtlz2
  • 7,700
  • 9
  • 64
  • 114
Jeff
  • 125,376
  • 21
  • 220
  • 187
2

You can achieve what you want with pdmongo in three lines:

import pdmongo as pdm
import pandas as pd
df = pdm.read_mongo("MyCollection", [], "mongodb://localhost:27017/mydb")

If your data is very large, you can do an aggregate query first by filtering data you do not want, then map them to your desired columns.

Here is an example of mapping Readings.a to column a and filtering by reportCount column:

import pdmongo as pdm
import pandas as pd
df = pdm.read_mongo("MyCollection", [{'$match': {'reportCount': {'$gt': 6}}}, {'$unwind': '$Readings'}, {'$project': {'a': '$Readings.a'}}], "mongodb://localhost:27017/mydb")

read_mongo accepts the same arguments as pymongo aggregate

pakallis
  • 181
  • 6
1

Following this great answer by waitingkuo I would like to add the possibility of doing that using chunksize in line with .read_sql() and .read_csv(). I enlarge the answer from Deu Leung by avoiding go one by one each 'record' of the 'iterator' / 'cursor'. I will borrow previous read_mongo function.

def read_mongo(db, 
           collection, query={}, 
           host='localhost', port=27017, 
           username=None, password=None,
           chunksize = 100, no_id=True):
""" Read from Mongo and Store into DataFrame """


# Connect to MongoDB
#db = _connect_mongo(host=host, port=port, username=username, password=password, db=db)
client = MongoClient(host=host, port=port)
# Make a query to the specific DB and Collection
db_aux = client[db]


# Some variables to create the chunks
skips_variable = range(0, db_aux[collection].find(query).count(), int(chunksize))
if len(skips_variable)<=1:
    skips_variable = [0,len(skips_variable)]

# Iteration to create the dataframe in chunks.
for i in range(1,len(skips_variable)):

    # Expand the cursor and construct the DataFrame
    #df_aux =pd.DataFrame(list(cursor_aux[skips_variable[i-1]:skips_variable[i]]))
    df_aux =pd.DataFrame(list(db_aux[collection].find(query)[skips_variable[i-1]:skips_variable[i]]))

    if no_id:
        del df_aux['_id']

    # Concatenate the chunks into a unique df
    if 'df' not in locals():
        df =  df_aux
    else:
        df = pd.concat([df, df_aux], ignore_index=True)

return df
Rafael Valero
  • 2,736
  • 18
  • 28
1

A similar approach like Rafael Valero, waitingkuo and Deu Leung using pagination:

def read_mongo(
       # db, 
       collection, query=None, 
       # host='localhost', port=27017, username=None, password=None,
       chunksize = 100, page_num=1, no_id=True):

    # Connect to MongoDB
    db = _connect_mongo(host=host, port=port, username=username, password=password, db=db)

    # Calculate number of documents to skip
    skips = chunksize * (page_num - 1)

    # Sorry, this is in spanish
    # https://www.toptal.com/python/c%C3%B3digo-buggy-python-los-10-errores-m%C3%A1s-comunes-que-cometen-los-desarrolladores-python/es
    if not query:
        query = {}

    # Make a query to the specific DB and Collection
    cursor = db[collection].find(query).skip(skips).limit(chunksize)

    # Expand the cursor and construct the DataFrame
    df =  pd.DataFrame(list(cursor))

    # Delete the _id
    if no_id:
        del df['_id']

    return df
Jordy Cuan
  • 467
  • 6
  • 13
1
  1. Start mongo in shell with: mongosh

  2. Scroll up on shell until you see where mongo is connected to. It should look something like this: mongodb://127.0.0.1:27017/?directConnection=true&serverSelectionTimeoutMS=2000&appName=mongosh+1.5.4

  3. Copy and paste that into mongoclient

  4. Here is the code:

from pymongo import MongoClient
import pandas as pd

client = MongoClient('mongodb://127.0.0.1:27017/?directConnection=true&serverSelectionTimeoutMS=2000&appName=mongosh+1.5.4')

mydatabase = client.yourdatabasename
mycollection = mydatabase.yourcollectionname
cursor = mycollection.find()
listofDocuments = list(cursor)
df = pd.DataFrame(listofDocuments)
df
user3062459
  • 1,587
  • 7
  • 27
  • 39
0

You can use the "pandas.json_normalize" method:

import pandas as pd
display(pd.json_normalize( x ))
display(pd.json_normalize( x , record_path="Readings" ))

It should display two tables, where x is your cursor or:

from bson import ObjectId
def ISODate(st):
    return st

x = {
"_cls" : "SensorReport",
"_id" : ObjectId("515a963b78f6a035d9fa531b"),
"_types" : [
    "SensorReport"
],
"Readings" : [
    {
        "a" : 0.958069536790466,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:26:35.297Z"),
        "b" : 6.296118156595,
        "_cls" : "Reading"
    },
    {
        "a" : 0.95574014778624,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:27:09.963Z"),
        "b" : 6.29651468650064,
        "_cls" : "Reading"
    },
    {
        "a" : 0.953648289182713,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:27:37.545Z"),
        "b" : 7.29679823731148,
        "_cls" : "Reading"
    },
    {
        "a" : 0.955931884300997,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:28:21.369Z"),
        "b" : 6.29642922525632,
        "_cls" : "Reading"
    },
    {
        "a" : 0.95821381,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:41:20.801Z"),
        "b" : 7.28956613,
        "_cls" : "Reading"
    },
    {
        "a" : 4.95821335,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:41:36.931Z"),
        "b" : 6.28956574,
        "_cls" : "Reading"
    },
    {
        "a" : 9.95821341,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:42:09.971Z"),
        "b" : 0.28956488,
        "_cls" : "Reading"
    },
    {
        "a" : 1.95667927,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:43:55.463Z"),
        "b" : 0.29115237,
        "_cls" : "Reading"
    }
],
"latestReportTime" : ISODate("2013-04-02T08:43:55.463Z"),
"sensorName" : "56847890-0",
"reportCount" : 8
}
Ahmad Zuhair
  • 39
  • 1
  • 4