1

I have a very large pandas df that I am attempting to insert into a MongoDB. The trouble is memory management. My code is below. I am using 'insert_many' to simply load the entire frame into the DB. This process is using a lot of memory. Is there a way to accomplish the same goal with less memory usage?

import pymongo
start = time()
client = pymongo.MongoClient()
db = client.test_db
collection = db.collection
collection.insert_many(data.to_dict('records')) 
end = time()
print ("Time to Populate DB:",end-start)
Jeff Saltfist
  • 933
  • 3
  • 15
  • 30
  • 1
    You can iterate over the DataFrame and only call `.to_dict('records')` on a sub-dataframe. There are many ways to achieve this. See this question: http://stackoverflow.com/questions/25699439/how-to-iterate-over-consecutive-chunks-of-pandas-dataframe-efficiently – Gustavo Bezerra May 21 '17 at 08:10
  • 1
    However, if you are having memory issues, you should think about a strategy to create your DataFrame in chunks. You seem be loading the whole DataFrame into memory before doing the MongoDB insert. For example, `pd.read_csv` has a `chunksize` option. The `pd.DataFrame.memory_usage` method is also useful. – Gustavo Bezerra May 21 '17 at 08:14
  • @GustavoBezerra - Both of those comments are very helpful. I will test out. – Jeff Saltfist May 21 '17 at 08:40

0 Answers0