62

What is the quickest way to insert a pandas DataFrame into mongodb using PyMongo?

Attempts

db.myCollection.insert(df.to_dict())

gave an error

InvalidDocument: documents must have only string keys, the key was Timestamp('2013-11-23 13:31:00', tz=None)


 db.myCollection.insert(df.to_json())

gave an error

TypeError: 'str' object does not support item assignment


 db.myCollection.insert({id: df.to_json()})

gave an error

InvalidDocument: documents must have only string a keys, key was <built-in function id>


df

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 150 entries, 2013-11-23 13:31:26 to 2013-11-23 13:24:07
Data columns (total 3 columns):
amount    150  non-null values
price     150  non-null values
tid       150  non-null values
dtypes: float64(2), int64(1)
Aviad Levy
  • 750
  • 4
  • 13
Nyxynyx
  • 61,411
  • 155
  • 482
  • 830
  • 1
    what do you want to do afterwards? do you want one doc per record or one doc per dataframe? – alko Nov 23 '13 at 20:22
  • Each mongo record will have the fields `date`, `amount`, `price`, and tid. `tid` should be a unique field – Nyxynyx Nov 23 '13 at 21:01
  • 2
    you can convert the dataframe to a dict-list by: `records = json.loads(df.to_json(orient='records'))`, the result will be like:`[{'c1': 1, 'c2': 1},{'c1': 2, 'c2': 2},{'c1': 3, 'c2': 3}]`, then just use `db.coll.insert_many(records)`. btw, use `df.to_dict('recoreds')` may counter `Type error` – Ferris May 22 '18 at 03:56

11 Answers11

76

Here you have the very quickest way. Using the insert_many method from pymongo 3 and 'records' parameter of to_dict method.

db.collection.insert_many(df.to_dict('records'))
Matt
  • 1,196
  • 1
  • 9
  • 22
dieguico
  • 1,286
  • 1
  • 10
  • 10
  • 2
    This is the best idea imo, although I don't think the syntax is going to work for the original use case. The basic problem is that mongo needs string keys, whereas your df has a Timestamp index. You need to use the parameters passed to `to_dict()` to make the keys in mongo be something other than dates. A frequent use case that I've had is where you actually want each row in the df to be a record with an additional 'date' field. – Marshall Farrier Feb 16 '16 at 20:43
  • You should correct the code snippet to include the collection. – hui chen Feb 21 '20 at 14:37
  • This doesn't preserve any data types though does it? Example {'numfield': NumberLong("16797951")} – NealWalters Nov 16 '20 at 18:32
48

I doubt there is a both quickest and simple method. If you don't worry about data conversion, you can do

>>> import json
>>> df = pd.DataFrame.from_dict({'A': {1: datetime.datetime.now()}})
>>> df
                           A
1 2013-11-23 21:14:34.118531

>>> records = json.loads(df.T.to_json()).values()
>>> db.myCollection.insert(records)

But in case you try to load data back, you'll get:

>>> df = read_mongo(db, 'myCollection')
>>> df
                     A
0  1385241274118531000
>>> df.dtypes
A    int64
dtype: object

so you'll have to convert 'A' columnt back to datetimes, as well as all not int, float or str fields in your DataFrame. For this example:

>>> df['A'] = pd.to_datetime(df['A'])
>>> df
                           A
0 2013-11-23 21:14:34.118531
Community
  • 1
  • 1
alko
  • 46,136
  • 12
  • 94
  • 102
  • 10
    `db.myCollection.insert(records)` should be replaced by `db.myCollection.insert_many(records)` see warning `//anaconda/bin/ipython:1: DeprecationWarning: insert is deprecated. Use insert_one or insert_many instead. #!/bin/bash //anaconda/bin/python.app` – Femto Trader Dec 24 '15 at 17:45
9

odo can do it using

odo(df, db.myCollection)
Femto Trader
  • 1,932
  • 2
  • 20
  • 25
  • 2
    I really like `odo`, but it fails terribly when the mongo uri has non alpha username, passwd. I wouldn't recommend it for anything but using an unauthenticated mongo. – armundle Aug 25 '16 at 04:17
  • 1
    I think odo's development has halted or been delayed recently, as of 2019-ish. – wordsforthewise Jun 03 '19 at 03:30
5

I think there is cool ideas in this question. In my case I have been spending time more taking care of the movement of large dataframes. In those case pandas tends to allow you the option of chunksize (for examples in the pandas.DataFrame.to_sql). So I think I con contribute here by adding the function I am using in this direction.

def write_df_to_mongoDB(  my_df,\
                          database_name = 'mydatabasename' ,\
                          collection_name = 'mycollectionname',
                          server = 'localhost',\
                          mongodb_port = 27017,\
                          chunk_size = 100):
    #"""
    #This function take a list and create a collection in MongoDB (you should
    #provide the database name, collection, port to connect to the remoete database,
    #server of the remote database, local port to tunnel to the other machine)
    #
    #---------------------------------------------------------------------------
    #Parameters / Input
    #    my_list: the list to send to MongoDB
    #    database_name:  database name
    #
    #    collection_name: collection name (to create)
    #    server: the server of where the MongoDB database is hosted
    #        Example: server = 'XXX.XXX.XX.XX'
    #    this_machine_port: local machine port.
    #        For example: this_machine_port = '27017'
    #    remote_port: the port where the database is operating
    #        For example: remote_port = '27017'
    #    chunk_size: The number of items of the list that will be send at the
    #        some time to the database. Default is 100.
    #
    #Output
    #    When finished will print "Done"
    #----------------------------------------------------------------------------
    #FUTURE modifications.
    #1. Write to SQL
    #2. Write to csv
    #----------------------------------------------------------------------------
    #30/11/2017: Rafael Valero-Fernandez. Documentation
    #"""



    #To connect
    # import os
    # import pandas as pd
    # import pymongo
    # from pymongo import MongoClient

    client = MongoClient('localhost',int(mongodb_port))
    db = client[database_name]
    collection = db[collection_name]
    # To write
    collection.delete_many({})  # Destroy the collection
    #aux_df=aux_df.drop_duplicates(subset=None, keep='last') # To avoid repetitions
    my_list = my_df.to_dict('records')
    l =  len(my_list)
    ran = range(l)
    steps=ran[chunk_size::chunk_size]
    steps.extend([l])

    # Inser chunks of the dataframe
    i = 0
    for j in steps:
        print j
        collection.insert_many(my_list[i:j]) # fill de collection
        i = j

    print('Done')
    return
Rafael Valero
  • 2,736
  • 18
  • 28
4

If your dataframe has missing data (i.e None,nan) and you don't want null key values in your documents:

db.insert_many(df.to_dict("records")) will insert keys with null values. If you don't want the empty key values in your documents you can use a modified version of pandas .to_dict("records") code below:

from pandas.core.common import _maybe_box_datetimelike
my_list = [dict((k, _maybe_box_datetimelike(v)) for k, v in zip(df.columns, row) if v != None and v == v) for row in df.values]
db.insert_many(my_list)

where the if v != None and v == v I've added checks to make sure the value is not None or nan before putting it in the row's dictionary. Now your .insert_many will only include keys with values in the documents (and no null data types).

Radical Edward
  • 5,234
  • 5
  • 21
  • 33
  • This is a good way because dealing with null values is indeed necessary when uploading dataframe to mongodb, and this method is faster that `DataFrame.to_dict()`, BTW, `columns = list(df.columns)`, then `[{k: _maybe_box_datetimelike(v) for k, v in zip(columns, row) if v != None and v == v} for row in df.values]` is even faster. – Woods Chen Dec 31 '18 at 06:11
4

I use the following part to insert the dataframe to a collection in the database.

df.reset_index(inplace=True)
data_dict = df.to_dict("records")
myCollection.insert_many(data_dict)
toshi456
  • 213
  • 1
  • 7
  • Question, if my data_dict has, say 1M rows, is it better to do one single insertMany? or should one do "batches" of insertMany? – lesolorzanov May 23 '23 at 16:47
2

how about this:

db.myCollection.insert({id: df.to_json()})

id will be a unique string for that df

PasteBT
  • 2,128
  • 16
  • 17
  • Thanks, I get the error `InvalidDocument: documents must have only string keys, key was ` – Nyxynyx Nov 23 '13 at 20:21
  • you have to generate that id by youself – PasteBT Nov 23 '13 at 20:22
  • Is this id the same as the usual `_.id` in mongo documents? If so, it looks like a random hash, how do I generate it? – Nyxynyx Nov 23 '13 at 21:04
  • It fails for @Nyxynyx since id is a buildin function in Python, overriding is not recommended. You can generate an simple test-id by using id(df) but since object ID's aren't persistent across sessions this might cause you problems depending on how you use it. Works for testing though. – erb Apr 21 '14 at 11:05
  • I got `maximum recursion level reached` error. Fixed it with `sys.setrecursionlimit(1000000)` – Gabriel Fair Apr 24 '18 at 10:37
1

Just make string keys!

import json
dfData = json.dumps(df.to_dict('records'))
savaData = {'_id': 'a8e42ed79f9dae1cefe8781760231ec0', 'df': dfData}
res = client.insert_one(savaData)

##### load dfData
data = client.find_one({'_id': 'a8e42ed79f9dae1cefe8781760231ec0'}).get('df')
dfData = json.loads(data)
df = pd.DataFrame.from_dict(dfData)
codejockie
  • 9,020
  • 4
  • 40
  • 46
Jon Pan
  • 11
  • 1
1

If you want to send several at one time:

db.myCollection.insert_many(df.apply(lambda x: x.to_dict(), axis=1).to_list())
0

If you want to make sure that you're not raising InvalidDocument errors, then something like the following is a good idea. This is because mongo does not recognize types such as np.int64, np.float64, etc.

from pymongo import MongoClient
client = MongoClient()
db = client.test 
col = db.col


def createDocsFromDF(df, collection = None, insertToDB=False):
    docs = [] 
    fields = [col for col in df.columns]
    for i in range(len(df)):
        doc = {col:df[col][i] for col in df.columns if col != 'index'}
        for key, val in doc.items():
            # we have to do this, because mongo does not recognize these np. types
            if type(val) == np.int64:
                doc[key] = int(val)
            if type(val) == np.float64:
                doc[key] = float(val)
            if type(val) == np.bool_:
                doc[key] = bool(val)
        docs.append(doc) 
    if insertToDB and collection:
        db.collection.insert_many(docs)
    return docs 
bpbirch
  • 11
  • 1
0

For upserts this worked.

for r in df2.to_dict(orient="records"):
    db['utest-pd'].update_one({'a':r['a']},{'$set':r})

Does it one record at a time but it didn't seem upsert_many was able to work with more than one filter value for different records.

citynorman
  • 4,918
  • 3
  • 38
  • 39