0

I have a 102gb CSV file exported from MongoDB that I'm trying to upload to Postgres. The file contains ~55 million rows. I'm using \copy to upload. However, I get a carriage return error on line 47,867,184:

ERROR: unquoted carriage return found in data
HINT: Use quoted CSV field to represent carriage return.
CONTEXT: COPY reso_facts, line 47867184 

To my knowledge, Postgres doesn't allow for skipping bad rows on import. Seems like I need to fix the file. Is there a way to delete a CSV row in-place using Python? I strongly prefer not to write the file to an external hard drive.

I found this elegant solution for txt files:

import os
from mmap import mmap
def removeLine(filename, lineno):
    f=os.open(filename, os.O_RDWR)
    m=mmap(f,0)
    p=0
    for i in range(lineno-1):
        p=m.find('\n',p)+1
    q=m.find('\n',p)
    m[p:q] = ' '*(q-p)
    os.close(f)

But it errors out when fed a CSV: TypeError: a bytes-like object is required, not 'str'. Is there a way to modify the above for CSVs, or is there an alternative method entirely?

mmz
  • 1,011
  • 1
  • 8
  • 21
  • 1
    I'm assuming the answer is for Python 2. You can try add `b'...'` prefix to the strings: `b'\n'` and `b' '*(q-p)` – Andrej Kesely Jan 31 '23 at 21:24
  • 1
    Thanks @AndrejKesely, that was exactly the issue. Never worked with Python 2 so missed that. Closing the question – mmz Jan 31 '23 at 21:37

0 Answers0