Long run time to insert data to MongoDB

Question

I am writing a script to time how long it takes to insert data from CSV file to MongoDB, 60MB file - around 650000 lines took ~9.5 seconds. I know using thread may decrease the run-time but I am new to using threads and I would love to get some help.

My code:

def timeitImportContent() :
    SETUP_CODE = '''
import pymongo
import csv
    '''

    TEST_CODE = '''
print("Attempting to connect to MongoDB.....")
client = pymongo.MongoClient('localhost', 27017)
collection = client['db']['myCollection']
print("Connection established.....")
print("Opening file at " + "path/to/my/file" + ".....")
csvFile = open("path/to/my/file", 'r')
print("Reading file.....")
data = csv.DictReader(csvFile)
print("Reading completed.....")
print("Inserting data into MongoDB")
collection.insert_many(data)
print("Successfully inserted data into MongoDB")
print("Attempting to close connection.....")
client.close()
print("Client disconnected")
print("Attempting to close CSV file.....")
csvFile.close();
print("CSV closed.....")
    '''

    times = timeit.timeit(setup = SETUP_CODE, stmt = TEST_CODE, number = 1)
    print("It took " + str(times) + " seconds to execute")


if __name__ == "__main__":
     timeitImportContent()

Processing as a stream attempt:

def getSingleRow(filename):
    with open(filename, 'r') as csv_file:
        data = csv.DictReader(csv_file)
        for row in data:
            yield row
        csv_file.close()
        return


def getData(filename):
    print("Attempting to connect to MongoDB.....")
    client = pymongo.MongoClient('localhost', 27017)
    collection = client['donorschoose']['MapData']
    print("Connection established.....")
    print("Inserting data into MongoDB")
    for row in getSingleRow(filename):
        collection.insert_one(row);
    print("Successfully inserted data into MongoDB")
    print("Attempting to close connection.....")
    client.close()
    print("Connection disconnected")


if __name__ == "__main__":
     getData("path/to/file/name")

we have a tool called mongoimport that can be used to import csv files also, check [this](https://stackoverflow.com/questions/4686500/how-to-use-mongoimport-to-import-csv),or search online for more, how much of the time the insert took? and how much it spend on DictReader. — Takis, Oct 14 '21 at 01:52
I used print() to see what process takes the longest time. Everything happens almost instantly, it will print up to "Inserting data into MongoDB" then freezes a bit, then successfully imported. — Kevin Truong, Oct 14 '21 at 02:26
It looks like that 60MB is going into RAM.. You may need to process as a stream. I found this, which looks promising : https://stackoverflow.com/a/17444799/10431732 — Matt Oestreich, Oct 14 '21 at 02:36
Thanks for your recommendation @MattOestreich. I don't have any problems reading the file, it happens almost instantly, but the line ```insert_many(data)``` takes a really long time to execute. Please review at my attempt trying to process it as a stream. It took almost a minute to finish — Kevin Truong, Oct 14 '21 at 03:11
As alternative you may use the [mongoimport](https://docs.mongodb.com/database-tools/mongoimport/) tool. It should be faster and you get also some stats at the end. — Wernfried Domscheit, Oct 14 '21 at 08:44

Long run time to insert data to MongoDB

0 Answers0