0

I am writing a script to time how long it takes to insert data from CSV file to MongoDB, 60MB file - around 650000 lines took ~9.5 seconds. I know using thread may decrease the run-time but I am new to using threads and I would love to get some help.

My code:

def timeitImportContent() :
    SETUP_CODE = '''
import pymongo
import csv
    '''

    TEST_CODE = '''
print("Attempting to connect to MongoDB.....")
client = pymongo.MongoClient('localhost', 27017)
collection = client['db']['myCollection']
print("Connection established.....")
print("Opening file at " + "path/to/my/file" + ".....")
csvFile = open("path/to/my/file", 'r')
print("Reading file.....")
data = csv.DictReader(csvFile)
print("Reading completed.....")
print("Inserting data into MongoDB")
collection.insert_many(data)
print("Successfully inserted data into MongoDB")
print("Attempting to close connection.....")
client.close()
print("Client disconnected")
print("Attempting to close CSV file.....")
csvFile.close();
print("CSV closed.....")
    '''

    times = timeit.timeit(setup = SETUP_CODE, stmt = TEST_CODE, number = 1)
    print("It took " + str(times) + " seconds to execute")


if __name__ == "__main__":
     timeitImportContent()

Processing as a stream attempt:

def getSingleRow(filename):
    with open(filename, 'r') as csv_file:
        data = csv.DictReader(csv_file)
        for row in data:
            yield row
        csv_file.close()
        return


def getData(filename):
    print("Attempting to connect to MongoDB.....")
    client = pymongo.MongoClient('localhost', 27017)
    collection = client['donorschoose']['MapData']
    print("Connection established.....")
    print("Inserting data into MongoDB")
    for row in getSingleRow(filename):
        collection.insert_one(row);
    print("Successfully inserted data into MongoDB")
    print("Attempting to close connection.....")
    client.close()
    print("Connection disconnected")


if __name__ == "__main__":
     getData("path/to/file/name")
  • we have a tool called mongoimport that can be used to import csv files also, check [this](https://stackoverflow.com/questions/4686500/how-to-use-mongoimport-to-import-csv),or search online for more, how much of the time the insert took? and how much it spend on DictReader. – Takis Oct 14 '21 at 01:52
  • I used print() to see what process takes the longest time. Everything happens almost instantly, it will print up to "Inserting data into MongoDB" then freezes a bit, then successfully imported. – Kevin Truong Oct 14 '21 at 02:26
  • I need to use Pandas to process data before import to Mongo – Kevin Truong Oct 14 '21 at 02:33
  • It looks like that 60MB is going into RAM.. You may need to process as a stream. I found this, which looks promising : https://stackoverflow.com/a/17444799/10431732 – Matt Oestreich Oct 14 '21 at 02:36
  • Thanks for your recommendation @MattOestreich. I don't have any problems reading the file, it happens almost instantly, but the line ```insert_many(data)``` takes a really long time to execute. Please review at my attempt trying to process it as a stream. It took almost a minute to finish – Kevin Truong Oct 14 '21 at 03:11
  • As alternative you may use the [mongoimport](https://docs.mongodb.com/database-tools/mongoimport/) tool. It should be faster and you get also some stats at the end. – Wernfried Domscheit Oct 14 '21 at 08:44

0 Answers0