11

I have a 0.7 GB MongoDB database containing tweets that I'm trying to load into a dataframe. However, I get an error.

MemoryError:    

My code looks like this:

cursor = tweets.find() #Where tweets is my collection
tweet_fields = ['id']
result = DataFrame(list(cursor), columns = tweet_fields)

I've tried the methods in the following answers, which at some point create a list of all the elements of the database before loading it.

However, in another answer which talks about list(), the person said that it's good for small data sets, because everything is loaded into memory.

In my case, I think it's the source of the error. It's too much data to be loaded into memory. What other method can I use?

Community
  • 1
  • 1
blue_chip
  • 666
  • 2
  • 6
  • 22

4 Answers4

11

I've modified my code to the following:

cursor = tweets.find(fields=['id'])
tweet_fields = ['id']
result = DataFrame(list(cursor), columns = tweet_fields)

By adding the fields parameter in the find() function I restricted the output. Which means that I'm not loading every field but only the selected fields into the DataFrame. Everything works fine now.

blue_chip
  • 666
  • 2
  • 6
  • 22
5

The fastest, and likely most memory-efficient way, to create a DataFrame from a mongodb query, as in your case, would be using monary.

This post has a nice and concise explanation.

shx2
  • 61,779
  • 13
  • 130
  • 153
  • I've seen this pushed on stackoverflow several times now, but it has almost always been misused. If you check the API, it seems to be pretty clear that the only datatypes allowed are numerical *not text*. In this case, I assume (due to the inclusion of the term "tweet") that this is *not* the proper use case of monary. – chase Aug 28 '19 at 04:31
3

The from_records classmethod is probably the best way to do it:

from pandas import pd
import pymongo

client = pymongo.MongoClient()
data = db.mydb.mycollection.find() # or db.mydb.mycollection.aggregate(pipeline)

df = pd.DataFrame.from_records(data)
2

an elegant way of doing it would be as follows:

import pandas as pd
def my_transform_logic(x):
    if x :
        do_something
        return result

def process(cursor):
    df = pd.DataFrame(list(cursor))
    df['result_col'] = df['col_to_be_processed'].apply(lambda value: my_transform_logic(value))

    #making list off dictionaries
    db.collection_name.insert_many(final_df.to_dict('records'))

    # or update
    db.collection_name.update_many(final_df.to_dict('records'),upsert=True)


#make a list of cursors.. you can read the parallel_scan api of pymongo

cursors = mongo_collection.parallel_scan(6)
for cursor in cursors:
    process(cursor)

I tried the above process on a mongoDB collection with 2.6 million records using Joblib on the above code. My code didnt throw any memory errors and the processing finished in 2 hrs.

James Z
  • 12,209
  • 10
  • 24
  • 44
Yayati Sule
  • 1,601
  • 13
  • 25