How to import data from mongodb to pandas?

Question

I have a large amount of data in a collection in mongodb which I need to analyze. How do i import that data to pandas?

I am new to pandas and numpy.

EDIT: The mongodb collection contains sensor values tagged with date and time. The sensor values are of float datatype.

Sample Data:

{
"_cls" : "SensorReport",
"_id" : ObjectId("515a963b78f6a035d9fa531b"),
"_types" : [
    "SensorReport"
],
"Readings" : [
    {
        "a" : 0.958069536790466,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:26:35.297Z"),
        "b" : 6.296118156595,
        "_cls" : "Reading"
    },
    {
        "a" : 0.95574014778624,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:27:09.963Z"),
        "b" : 6.29651468650064,
        "_cls" : "Reading"
    },
    {
        "a" : 0.953648289182713,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:27:37.545Z"),
        "b" : 7.29679823731148,
        "_cls" : "Reading"
    },
    {
        "a" : 0.955931884300997,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:28:21.369Z"),
        "b" : 6.29642922525632,
        "_cls" : "Reading"
    },
    {
        "a" : 0.95821381,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:41:20.801Z"),
        "b" : 7.28956613,
        "_cls" : "Reading"
    },
    {
        "a" : 4.95821335,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:41:36.931Z"),
        "b" : 6.28956574,
        "_cls" : "Reading"
    },
    {
        "a" : 9.95821341,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:42:09.971Z"),
        "b" : 0.28956488,
        "_cls" : "Reading"
    },
    {
        "a" : 1.95667927,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:43:55.463Z"),
        "b" : 0.29115237,
        "_cls" : "Reading"
    }
],
"latestReportTime" : ISODate("2013-04-02T08:43:55.463Z"),
"sensorName" : "56847890-0",
"reportCount" : 8
}

Using [a custom field type](https://gist.github.com/jdthorpe/93145e8093258a3b73b2bd458533176d) with MongoEngine can make storing and retrieving Pandas DataFrames as simple as `mongo_doc.data_frame = my_pandas_df` — Jthorpe, Jul 15 '17 at 16:04

score 173 · Accepted Answer · edited Dec 27 '22 at 01:21

173

pymongo might give you a hand, followings is some code I'm using:

import pandas as pd
from pymongo import MongoClient


def _connect_mongo(host, port, username, password, db):
    """ A util for making a connection to mongo """

    if username and password:
        mongo_uri = 'mongodb://%s:%s@%s:%s/%s' % (username, password, host, port, db)
        conn = MongoClient(mongo_uri)
    else:
        conn = MongoClient(host, port)


    return conn[db]


def read_mongo(db, collection, query={}, host='localhost', port=27017, username=None, password=None, no_id=True):
    """ Read from Mongo and Store into DataFrame """

    # Connect to MongoDB
    db = _connect_mongo(host=host, port=port, username=username, password=password, db=db)

    # Make a query to the specific DB and Collection
    cursor = db[collection].find(query)

    # Expand the cursor and construct the DataFrame
    df =  pd.DataFrame(list(cursor))

    # Delete the _id
    if no_id:
        del df['_id']

    return df

edited Dec 27 '22 at 01:21

starball

20,030
7
43
238

answered Apr 27 '13 at 18:45

waitingkuo

89,478
28
112
118

Thanks, this is the method i ended up using. I also had an array of embedded documents in each row. So I had to iterate that as well within each row. Is there a better way to do this?? – Nithin Apr 29 '13 at 06:42
Is it possible to provide some samples of your mongodb's structure? – waitingkuo Apr 29 '13 at 07:01
Look at edit for a sample data row. An array of the embedded document "Reading" is stored inside readings. Now i am doing a query to find multiple records and then iterate through each reading in the readings array for each and every record. Is there any easier way to import data in my case? – Nithin Apr 29 '13 at 10:43
3

Note the `list()` inside `df = pd.DataFrame(list(cursor))` evaluates as a list or generator, to keep the CPU cool. If u have a zillionty-one data items, and the next few lines would have reasonably partioned, level-of-detailed, and clipped them, the whole shmegegge is still safe to drop in. Nice. – Phlip Sep 22 '15 at 13:33
3

It's very slow @ `df = pd.DataFrame(list(cursor))`. Pure db quering is much faster. Could we change `list` casting to something else? – Peter.k Jan 19 '19 at 19:53
2

@Peter that line also caught my eyes. Casting a database cursor, which is designed to be iterable and potentially wraps large amounts of data, into an in-memory list does not seem clever to me. – Rafa Viotti Apr 26 '19 at 22:26
I have a cursor with ~150k rows and the list(cursor) alone takes 20s while the query takes 0.01s. Is there a faster way to load mongodb data into a pandas dataframe? – Jérémy Talbot-Pâquet Nov 27 '19 at 22:47
@Peter.k @Rafa Use iter: `pd.DataFrame(iter(cursor))`. Unless you want to use something like dask to never have *all* the data loaded into memory (for the most part). – Noah May Feb 07 '21 at 06:19

score 52 · Answer 2 · edited Aug 12 '16 at 23:31

52

You can load your mongodb data to pandas DataFrame using this code. It works for me. Hopefully for you too.

import pymongo
import pandas as pd
from pymongo import MongoClient
client = MongoClient()
db = client.database_name
collection = db.collection_name
data = pd.DataFrame(list(collection.find()))

edited Aug 12 '16 at 23:31

orluke

2,041
1
17
15

answered Dec 23 '14 at 09:15

saimadhu.polamuri

4,439
2
24
21

score 27 · Answer 3 · answered Oct 23 '16 at 11:43

27

As per PEP, simple is better than complicated:

import pandas as pd
df = pd.DataFrame.from_records(db.<database_name>.<collection_name>.find())

You can include conditions as you would working with regular mongoDB database or even use find_one() to get only one element from the database, etc.

and voila!

answered Oct 23 '16 at 11:43

Cy Bu

1,401
2
22
33

pd.DataFrame.from_records seems to be as slow as DataFrame(list()), but the results are very inconsistent. %%time showed anything from 800 ms to 1.9 s – Andrei Drynov May 14 '19 at 14:35
2

This isnt good for huge records as this doesnot shows memory error, instread hangs the system for too big data. while pd.DataFrame(list(cursor)) shows memory error. – Amulya Acharya Jun 25 '19 at 03:45

score 25 · Answer 4 · answered Dec 19 '13 at 22:33

25

Monary does exactly that, and it's super fast. (another link)

See this cool post which includes a quick tutorial and some timings.

answered Dec 19 '13 at 22:33

shx2

61,779
13
130
153

Does Monary support string data type ? – Snehal Parmar Jan 01 '15 at 07:44
I tried Monary, but it is taking a lot of time. Am I missing some optimization? Tried `client = Monary(host, 27017, database="db_tmp") columns = ["col1", "col2"] data_type = ["int64", "int64"] arrays = client.query("db_tmp", "coll", {}, columns, data_type)` For `50000` records takes around `200s`. – nishant Nov 27 '17 at 11:09
That sounds extremely slow... Frankly, I don't know what the status of this project is, now, 4 years later... – shx2 Nov 27 '17 at 13:54

score 17 · Answer 5 · edited Jan 26 '21 at 17:31

17

Another option I found very useful is:

from pandas.io.json import json_normalize

cursor = my_collection.find()
df = json_normalize(cursor)

(or json_normalize(list(cursor)), depending on your python/pandas versions).

This way you get the unfolding of nested mongodb documents for free.

edited Jan 26 '21 at 17:31

Skippy le Grand Gourou

6,976
4
60
76

answered Mar 29 '18 at 08:57

Ikar Pohorský

4,617
6
39
56

2

I got an error with this method `TypeError: data argument can't be an iterator` – Gabriel Fair Apr 02 '18 at 11:50
2

Strange, this works on my python `3.6.7` using pandas `0.24.2`. Maybe you can try `df = json_normalize(list(cursor))` instead? – Ikar Pohorský Jun 24 '19 at 08:20
For +1. docs, max_level argument defines max level of dict depth. I just made a test and it's not true, so some columns would need to be split with .str accesrors. Still, very nice feature for working with mongodb. – Mauricio Maroto Jun 11 '20 at 22:53

score 15 · Answer 6 · edited Oct 20 '16 at 23:33

15

import pandas as pd
from odo import odo

data = odo('mongodb://localhost/db::collection', pd.DataFrame)

edited Oct 20 '16 at 23:33

Baum mit Augen

49,044
25
144
182

answered Oct 20 '16 at 23:33

fengwt

189
2
8

score 10 · Answer 7 · answered Sep 27 '16 at 00:16

10

For dealing with out-of-core (not fitting into RAM) data efficiently (i.e. with parallel execution), you can try Python Blaze ecosystem: Blaze / Dask / Odo.

Blaze (and Odo) has out-of-the-box functions to deal with MongoDB.

A few useful articles to start off:

Introducing Blaze Expessions (with MongoDB query example)
ReproduceIt: Reddit word count
Difference between Dask Arrays and Blaze

And an article which shows what amazing things are possible with Blaze stack: Analyzing 1.7 Billion Reddit Comments with Blaze and Impala (essentially, querying 975 Gb of Reddit comments in seconds).

P.S. I'm not affiliated with any of these technologies.

answered Sep 27 '16 at 00:16

Dennis Golomazov

16,269
5
73
81

1

I've also written a [post](https://github.com/goldan/data-science/blob/master/notebooks/dask-performance.ipynb) using Jupyter Notebook with an example how Dask helps to speedup execution even on a data fitting into memory by using multiple cores on a single machine. – Dennis Golomazov Sep 27 '16 at 23:53
Looks like blaze is deprecated. – wordsforthewise Jan 20 '23 at 20:03

score 6 · Answer 8 · answered Sep 12 '16 at 08:19

Using

pandas.DataFrame(list(...))

will consume a lot of memory if the iterator/generator result is large

better to generate small chunks and concat at the end

def iterator2dataframes(iterator, chunk_size: int):
  """Turn an iterator into multiple small pandas.DataFrame

  This is a balance between memory and efficiency
  """
  records = []
  frames = []
  for i, record in enumerate(iterator):
    records.append(record)
    if i % chunk_size == chunk_size - 1:
      frames.append(pd.DataFrame(records))
      records = []
  if records:
    frames.append(pd.DataFrame(records))
  return pd.concat(frames)

score 5 · Answer 9 · answered May 11 '21 at 16:57

5

You can also use pymongoarrow -- it's an official library offered by MongoDB for exporting mongodb data to pandas, numPy, parquet files, etc.

answered May 11 '21 at 16:57

Rachelle Palmer Frick

61
1
2

this library is almost useless due to very limited type support, it does not even support str. – Wang Jul 11 '21 at 13:48

score 3 · Answer 10 · edited Apr 03 '20 at 13:51

3

http://docs.mongodb.org/manual/reference/mongoexport

export to csv and use read_csv or JSON and use DataFrame.from_records()

edited Apr 03 '20 at 13:51

jtlz2

7,700
9
64
114

answered Apr 27 '13 at 11:32

Jeff

125,376
21
220
187

2

It's `DataFrame.from_records()`. – Morten Apr 23 '14 at 14:12

score 2 · Answer 11 · answered Aug 05 '20 at 20:19

You can achieve what you want with pdmongo in three lines:

import pdmongo as pdm
import pandas as pd
df = pdm.read_mongo("MyCollection", [], "mongodb://localhost:27017/mydb")

If your data is very large, you can do an aggregate query first by filtering data you do not want, then map them to your desired columns.

Here is an example of mapping Readings.a to column a and filtering by reportCount column:

import pdmongo as pdm
import pandas as pd
df = pdm.read_mongo("MyCollection", [{'$match': {'reportCount': {'$gt': 6}}}, {'$unwind': '$Readings'}, {'$project': {'a': '$Readings.a'}}], "mongodb://localhost:27017/mydb")

read_mongo accepts the same arguments as pymongo aggregate

Cool tool but doesn't work with TLS/activedirectory auth. – wordsforthewise Jan 20 '23 at 20:14 — wordsforthewise, Jan 20 '23 at 20:14

Rafael Valero · Answer 12 · 2018-03-06T12:09:51.677

Following this great answer by waitingkuo I would like to add the possibility of doing that using chunksize in line with .read_sql() and .read_csv(). I enlarge the answer from Deu Leung by avoiding go one by one each 'record' of the 'iterator' / 'cursor'. I will borrow previous read_mongo function.

def read_mongo(db, 
           collection, query={}, 
           host='localhost', port=27017, 
           username=None, password=None,
           chunksize = 100, no_id=True):
""" Read from Mongo and Store into DataFrame """


# Connect to MongoDB
#db = _connect_mongo(host=host, port=port, username=username, password=password, db=db)
client = MongoClient(host=host, port=port)
# Make a query to the specific DB and Collection
db_aux = client[db]


# Some variables to create the chunks
skips_variable = range(0, db_aux[collection].find(query).count(), int(chunksize))
if len(skips_variable)<=1:
    skips_variable = [0,len(skips_variable)]

# Iteration to create the dataframe in chunks.
for i in range(1,len(skips_variable)):

    # Expand the cursor and construct the DataFrame
    #df_aux =pd.DataFrame(list(cursor_aux[skips_variable[i-1]:skips_variable[i]]))
    df_aux =pd.DataFrame(list(db_aux[collection].find(query)[skips_variable[i-1]:skips_variable[i]]))

    if no_id:
        del df_aux['_id']

    # Concatenate the chunks into a unique df
    if 'df' not in locals():
        df =  df_aux
    else:
        df = pd.concat([df, df_aux], ignore_index=True)

return df

score 1 · Answer 13 · answered Mar 20 '18 at 01:19

A similar approach like Rafael Valero, waitingkuo and Deu Leung using pagination:

def read_mongo(
       # db, 
       collection, query=None, 
       # host='localhost', port=27017, username=None, password=None,
       chunksize = 100, page_num=1, no_id=True):

    # Connect to MongoDB
    db = _connect_mongo(host=host, port=port, username=username, password=password, db=db)

    # Calculate number of documents to skip
    skips = chunksize * (page_num - 1)

    # Sorry, this is in spanish
    # https://www.toptal.com/python/c%C3%B3digo-buggy-python-los-10-errores-m%C3%A1s-comunes-que-cometen-los-desarrolladores-python/es
    if not query:
        query = {}

    # Make a query to the specific DB and Collection
    cursor = db[collection].find(query).skip(skips).limit(chunksize)

    # Expand the cursor and construct the DataFrame
    df =  pd.DataFrame(list(cursor))

    # Delete the _id
    if no_id:
        del df['_id']

    return df

user3062459 · Answer 14 · 2022-08-24T19:54:07.840

Start mongo in shell with: mongosh
Scroll up on shell until you see where mongo is connected to. It should look something like this: mongodb://127.0.0.1:27017/?directConnection=true&serverSelectionTimeoutMS=2000&appName=mongosh+1.5.4
Copy and paste that into mongoclient
Here is the code:

from pymongo import MongoClient
import pandas as pd

client = MongoClient('mongodb://127.0.0.1:27017/?directConnection=true&serverSelectionTimeoutMS=2000&appName=mongosh+1.5.4')

mydatabase = client.yourdatabasename
mycollection = mydatabase.yourcollectionname
cursor = mycollection.find()
listofDocuments = list(cursor)
df = pd.DataFrame(listofDocuments)
df

Ahmad Zuhair · Answer 15 · 2021-11-07T03:07:54.807

You can use the "pandas.json_normalize" method:

import pandas as pd
display(pd.json_normalize( x ))
display(pd.json_normalize( x , record_path="Readings" ))

It should display two tables, where x is your cursor or:

from bson import ObjectId
def ISODate(st):
    return st

x = {
"_cls" : "SensorReport",
"_id" : ObjectId("515a963b78f6a035d9fa531b"),
"_types" : [
    "SensorReport"
],
"Readings" : [
    {
        "a" : 0.958069536790466,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:26:35.297Z"),
        "b" : 6.296118156595,
        "_cls" : "Reading"
    },
    {
        "a" : 0.95574014778624,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:27:09.963Z"),
        "b" : 6.29651468650064,
        "_cls" : "Reading"
    },
    {
        "a" : 0.953648289182713,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:27:37.545Z"),
        "b" : 7.29679823731148,
        "_cls" : "Reading"
    },
    {
        "a" : 0.955931884300997,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:28:21.369Z"),
        "b" : 6.29642922525632,
        "_cls" : "Reading"
    },
    {
        "a" : 0.95821381,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:41:20.801Z"),
        "b" : 7.28956613,
        "_cls" : "Reading"
    },
    {
        "a" : 4.95821335,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:41:36.931Z"),
        "b" : 6.28956574,
        "_cls" : "Reading"
    },
    {
        "a" : 9.95821341,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:42:09.971Z"),
        "b" : 0.28956488,
        "_cls" : "Reading"
    },
    {
        "a" : 1.95667927,
        "_types" : [
            "Reading"
        ],
        "ReadingUpdatedDate" : ISODate("2013-04-02T08:43:55.463Z"),
        "b" : 0.29115237,
        "_cls" : "Reading"
    }
],
"latestReportTime" : ISODate("2013-04-02T08:43:55.463Z"),
"sensorName" : "56847890-0",
"reportCount" : 8
}

How to import data from mongodb to pandas?

15 Answers15

Linked