How to fix memory error while importing a very large csv file to mongodb in python?

Question

Given below is the code for importing a pipe delimited csv file to monogdb.

import csv
import json
from pymongo import MongoClient

url = "mongodb://localhost:27017"
client = MongoClient(url)
db = client.Office
customer = db.Customer
jsonArray = []

with open("Names.txt", "r") as csv_file:
    csv_reader = csv.DictReader(csv_file, dialect='excel', delimiter='|', quoting=csv.QUOTE_NONE)
    for row in csv_reader:
        jsonArray.append(row)
    jsonString = json.dumps(jsonArray, indent=1, separators=(",", ":"))
    jsonfile = json.loads(jsonString)
    customer.insert_many(jsonfile)

Below is the error I get when running the above code.

Traceback (most recent call last):
  File "E:\Anaconda Projects\Mongo Projects\Office Tool\csvtojson.py", line 16, in <module>
    jsonString = json.dumps(jsonArray, indent=1, separators=(",", ":"))
  File "C:\Users\Predator\anaconda3\lib\json\__init__.py", line 234, in dumps
    return cls(
  File "C:\Users\Predator\anaconda3\lib\json\encoder.py", line 201, in encode
    chunks = list(chunks)
MemoryError

I if modify the code with some indents under the for loop. The MongoDB gets imported with the same data all over again without stopping.

import csv
import json
from pymongo import MongoClient

url = "mongodb://localhost:27017"
client = MongoClient(url)
db = client.Office
customer = db.Customer
jsonArray = []

with open("Names.txt", "r") as csv_file:
    csv_reader = csv.DictReader(csv_file, dialect='excel', delimiter='|', quoting=csv.QUOTE_NONE)
    for row in csv_reader:
        jsonArray.append(row)
        jsonString = json.dumps(jsonArray, indent=1, separators=(",", ":"))
        jsonfile = json.loads(jsonString)
        customer.insert_many(jsonfile)

Using [mongoimport](https://docs.mongodb.com/database-tools/mongoimport/) might be the better option. Note, `mongoimport` also accepts input from STDIN, you could convert lines in python and then print to STDIN instead of writing a separate file. — Wernfried Domscheit, Jan 15 '22 at 10:54
I don't know python but perhaps insert like `if (row % 1000 == 0) customer.insert_many(jsonfile)`, i.e. Insert documents in batches of 1000 — Wernfried Domscheit, Jan 15 '22 at 11:00
@Wernfried Domscheit this is the result ` Traceback (most recent call last): File "E:\Anaconda Projects\Mongo Projects\SDR Tool\csvtojson.py", line 18, in if row % 1000 == 0: TypeError: unsupported operand type(s) for %: 'dict' and 'int' ` — CyberNoob, Jan 15 '22 at 11:18
Can you add a small sample of the file format - just 3 or 4 lines — Belly Buster, Jan 15 '22 at 11:48
@Belly Buster The problem comes exactly here because the header is different for each csv file. for example Registration No|Date of Birth|Profession etc changes to RegNo|DOB|Job in another csv file. — CyberNoob, Jan 15 '22 at 12:48

score 1 · Answer 1 · answered Jan 15 '22 at 12:15

1

I would recommend you use pandas; it provides a "chunked" mode by setting a chunksize parameter which you can tweak depending on your memory limitations. insert_many() is also more efficient.

Plus the code become much simpler:

import pandas as pd
filename = "Names.txt"

with pd.read_csv(filename, chunksize=1000, delimiter='|') as reader:
    for chunk in reader:
        db.mycollection.insert_many(chunk.to_dict('records'))

If you post a file sample I can update to match.

answered Jan 15 '22 at 12:15

Belly Buster

8,224
2
7
20

File "C:\Users\Predator\anaconda3\lib\site-packages\pandas\io\parsers\python_parser.py", line 722, in _alert_malformed raise ParserError(msg) pandas.errors.ParserError: Expected 48 fields in line 117143, saw 49 – CyberNoob Jan 15 '22 at 12:42
Given Below is the code I used. `with pd.read_csv(csv_file, chunksize=1000, delimiter='|', engine='python', encoding='latin-1', quoting=csv.QUOTE_NONE) as reader: for chunk in reader:` – CyberNoob Jan 15 '22 at 12:50
I can't really help more without seeing a file sample. – Belly Buster Jan 15 '22 at 13:12
Also that error is pretty straightforward - you have 48 fields in your headers and a line with 49 entries. – Belly Buster Jan 15 '22 at 13:15
This is because some text might have a coma inserted in between. – CyberNoob Jan 15 '22 at 13:20
is it possible to make a chunksize=1000 in with this code `csv.DictReader(csv_file, dialect='excel', delimiter='|', quoting=csv.QUOTE_NONE) as reader:` – CyberNoob Jan 15 '22 at 13:28
If you set the delimiter to `|` a comma won't be an issue. You might have an extra `|` though. – Belly Buster Jan 15 '22 at 14:31
I tried using the sed command and replaced all | with comma and the file was loaded without any errors. – CyberNoob Jan 15 '22 at 14:42
This was my issue https://stackoverflow.com/questions/70708872/how-to-convert-pipe-delimited-to-csv-or-json/70710266?noredirect=1#comment125004652_70710266 – CyberNoob Jan 15 '22 at 14:45

CyberNoob · Accepted Answer · 2022-01-18T07:15:12.510

The memory issue can be solved by inserting one record at a time.

import csv
import json

from pymongo import MongoClient

url_mongo = "mongodb://localhost:27017"
client = MongoClient(url_mongo)
db = client.Office
customer = db.Customer
jsonArray = []
file_txt = "Text.txt"
rowcount = 0
with open(file_txt, "r") as txt_file:
    csv_reader = csv.DictReader(txt_file, dialect="excel", delimiter="|", quoting=csv.QUOTE_NONE)
    for row in csv_reader:
        rowcount += 1
        jsonArray.append(row)
    for i in range(rowcount):
        jsonString = json.dumps(jsonArray[i], indent=1, separators=(",", ":"))
        jsonfile = json.loads(jsonString)
        customer.insert_one(jsonfile)
print("Finished")

Thank You All for Your Ideas

How to fix memory error while importing a very large csv file to mongodb in python?

2 Answers2