0

I have a ~4GB txt file which is pipe delimited. I am trying to import this text to MongoDB but as you know MongoDB supports only JSON and CSV files. Below is the code so far.

import pandas as pd
import csv
from pymongo import MongoClient

url = "mongodb://localhost:27017"
client = MongoClient(url)
# Creating Database Office
db = client.Office
# Creating Collection Customers
customers = db.Customers

filename = "Names.txt"
data_df = pd.read_fwf(filename, sep="|", engine="python", encoding="latin-1")
fileout = "Names.csv"
output = data_df.to_csv(fileout, sep=",")
print("Finished")
fin = open("Names.csv", "r")
file_data = fin.read()
file_csv = csv.reader(file_data)
Customers.insert_many(file_csv)

The input file "Name.txt" looks like this

Reg|Name|DOB|Friend|Nationality|Profession^M
1122|Sam|01/01/2001|John|USA|Lawyer^M
2456|George|05/10/1999|Pit|Canada|Engineer^M
5645|Brad|02/06/2000|Adam|UK|Doctor^M

If the provided text file is CSV then simply import it to MongoDB or if the txt file is pipe delimited or any other delimited then import it to MongoDB after only after processing the text file to a CSV file. The CSV file that I get in fileout, when imported manually to MongoDB the result looks like this.

col1          col2
id    Reg|Name|DOB|Friend|Nationality|Profession
1     1122|Sam|01/01/2001|John|USA|Lawyer
2     2456|George|05/10/1999|Pit|Canada|Engineer
3     5645|Brad|02/06/2000|Adam|UK|Doctor

What I want to achieve is shown below. This was done with the sed command. First I replaced any "," if in the txt file with "-" using the command

sed -i 's/,/-/g' Names.txt

then I replaced the pipe delimiter with ",":

sed -i 's/|/,/g' Names.txt
col1 col2  col3   col4       col5    col6        col7
id   Reg   Name   DOB        Friend  Nationality Profession
1    1122  Sam    01/01/2001 John    USA         Lawyer
2    2456  George 05/10/1999 Pit     Canada      Engineer
3    5645  Brad   02/06/2000 Adam    UK          Doctor

I know that the code is not doing anything. But I can't figure out how to make it work.

I am new to all type of programming and I have searched through various answers regarding this question and various other related questions in the site, but none fits my needs.

UPDATE

import csv
import json
from pymongo import MongoClient

url = "mongodb://localhost:27017"
client = MongoClient(url)
db = client.Office
customer = db.Customer
jsonArray = []

with open("Names.txt", "r") as csv_file:
    csv_reader = csv.DictReader(csv_file, dialect='excel', delimiter='|', quoting=csv.QUOTE_NONE)
    for row in csv_reader:
        jsonArray.append(row)
    jsonString = json.dumps(jsonArray, indent=1, separators=(",", ":"))
    jsonfile = json.loads(jsonString)
    customer.insert_many(jsonfile)

This is the new code I came up with after getting some ideas from comments. But now the only problem is I get this error.

Traceback (most recent call last):
  File "E:\Anaconda Projects\Mongo Projects\Office Tool\csvtojson.py", line 16, in <module>
    jsonString = json.dumps(jsonArray, indent=1, separators=(",", ":"))
  File "C:\Users\Predator\anaconda3\lib\json\__init__.py", line 234, in dumps
    return cls(
  File "C:\Users\Predator\anaconda3\lib\json\encoder.py", line 201, in encode
    chunks = list(chunks)
MemoryError
CyberNoob
  • 37
  • 7
  • As an aside, repeatedly overwriting the same file with `sed -i` is inefficient, inelegant, and error-prone; it is easy to combine the two `sed` scripts into one. See e.g. https://stackoverflow.com/questions/7657647/combining-two-sed-commands – tripleee Jan 14 '22 at 10:06
  • Doesn't `read_csv(filename, sep='|')` trivially do what you ask? – tripleee Jan 14 '22 at 10:07
  • Please [don’t post images of code, error messages, or other textual data.](https://meta.stackoverflow.com/questions/303812/discourage-screenshots-of-code-and-or-errors) – tripleee Jan 14 '22 at 10:11
  • @tripleee I have already tried the read_csv() but still gets the same result. – CyberNoob Jan 14 '22 at 10:16
  • 1
    **Please** don't add back the pleasantries to your question. We prefer posts to remain strictly focused on the technical content. You can express gratitude by upvoting useful contributions once you start receiving them, but the chance is lower if your question is full of junk. – tripleee Jan 14 '22 at 10:17
  • Please [edit] to provide a (small!) sample of the input file, as actual text we can copy/paste. If it is really in a fixed-width format, that detail needs to be explained with more details about the input format. – tripleee Jan 14 '22 at 10:19
  • @tripleee I will keep that in mind. Should I edit the post? – CyberNoob Jan 14 '22 at 10:19
  • I already rolled back the latest edit, but still please update the technical details as requested above; provide a sample of the input, and ideally replace the images with similarly formatted text instead. – tripleee Jan 14 '22 at 10:20

2 Answers2

1

Pandas read_fwf() is for data files where data is in a fixed column. Sometimes they might have a separator as well (usually a pipe character to make the data table easier to read).

You can read a pipe-separated file with readcsv(). Just use the sep='|':

df = pd.read_csv(filename, sep='|')

Now you can insert the data into the mongo collection converting the dataframe to a dict this way:

Customers.insert_many( df.to_dict(orient='records') )
Iñigo González
  • 3,735
  • 1
  • 11
  • 27
  • this is the error when trying the above method Traceback (most recent call last): File "E:\Anaconda Projects\Mongo Projects\SDR Tool\pymongogridfs.py", line 52, in data_df = pd.read_csv(filename, sep="|", engine="python", encoding="latin-1", quoting=csv.QUOTE_NONE) File "C:\Users\Predator\anaconda3\lib\site-packages\pandas\util\_decorators.py", line 311, in wrapper return func(*args, **kwargs)'''''''''''''''''''''' ''''''''''''''''''''''' """"""""""""""" pandas.errors.ParserError: Expected 48 fields in line 117481, saw 49 – CyberNoob Jan 14 '22 at 12:22
  • but if i write the above pipe delimited file to a new csv file using csv.writer(), then no error pops. – CyberNoob Jan 14 '22 at 12:27
  • 1
    Your pipe-separated file has a pipe inside a fied. Look at line 117481 and find out if it has some kind of escape character or quotation on it. You can try loading it with `pandas.read_csv()` using the `on_bad_lines='skip'` or `on_bad_lines='warn'` option. But *do take a look at the data* anyway. – Iñigo González Jan 14 '22 at 13:06
  • I updated the code but it shows memory error. I think it is possible to by pass this error with cunk but I don't know how to implement it. – CyberNoob Jan 15 '22 at 06:10
0

Finally found the solution.

I tested it on a 5GB file although slow it still works. It imports all data from a pipe delimited txt file to MongoDB.

import csv
import json

from pymongo import MongoClient

url_mongo = "mongodb://localhost:27017"
client = MongoClient(url_mongo)
db = client.Office
customer = db.Customer
jsonArray = []
file_txt = "Text.txt"
rowcount = 0
with open(file_txt, "r") as txt_file:
    csv_reader = csv.DictReader(txt_file, dialect="excel", delimiter="|", quoting=csv.QUOTE_NONE)
    for row in csv_reader:
        rowcount += 1
        jsonArray.append(row)
    for i in range(rowcount):
        jsonString = json.dumps(jsonArray[i], indent=1, separators=(",", ":"))
        jsonfile = json.loads(jsonString)
        customer.insert_one(jsonfile)
print("Finished")

Thank You All for your Ideas

CyberNoob
  • 37
  • 7