0

I'm trying to import 5'000 .txt files into a postgresql database. My script is running fine as long as it doesn't reach a line which doesn't fit the format. For example every file has a new line at the end which also causes the script to crash.

I've tried to handle exceptions but to no success...

My script:

import csv
import os
import sys

import psycopg2

conn = psycopg2.connect(
    host="localhost",
    database="demo",
    user="demo",
    password="123",
    port="5432"
)

cur = conn.cursor()

maxInt = sys.maxsize

while True:
    try:
        csv.field_size_limit(maxInt)
        break
    except OverflowError:
        maxInt = int(maxInt / 10)


def searchFiles(directory='', extension=''):
    print('SEARCHING IN: ', directory)
    filelist = []
    extension = extension.lower()
    for dirpath, dirnames, files in os.walk(directory):
        for name in files:
            if extension and name.lower().endswith(extension):
                filelist.append(os.path.join(dirpath, name))
            elif not extension:
                print('FAILED TO READ: ', (os.path.join(dirpath, name)))
    print('FINISHED FILE SEARCH AND FOUND ', str(len(filelist)), ' FILES')
    return filelist


def importData(fileToImport):
    with open(fileToImport, 'r') as f:
        reader = csv.reader(f, delimiter=':')

        for line in reader:
            try:
                cur.execute("""INSERT INTO demo VALUES (%s, %s)""", (line[0], line[1]))
                conn.commit()
            except:
                pass
                print('FAILED AT LINE: ', line)


print(conn.get_dsn_parameters())
cur.execute("SELECT version();")
record = cur.fetchone()
print("You are connected to - ", record)

fileList = searchFiles('output', '.txt')

counter = 0
length = len(fileList)
for file in fileList:
    # if counter % 10 == 0:
    print('Processing File: ', str(file), ', COMPLETED: ', str(counter), '/', str(length))
    importData(str(file))
    counter += 1
print('FINISHED IMPORT OF ', str(length), ' FILES')

A few lines of the data I'm trying to import:

example1@example.com:123456
example2@example.com:password!1

The error I'm getting:

File "import.py", line 66, in <module>
    importData(str(file))
File "import.py", line 45, in importData
    for line in reader:
_csv.Error: line contains NULL byte

How should I handle lines which can not get imported?

Thanks for any help

Baum mit Augen
  • 49,044
  • 25
  • 144
  • 182
In0cenT
  • 481
  • 2
  • 11
  • 25
  • Can you elaborate on how your code "doesn't work"? What were you expecting, and what actually happened? If you got an exception/error, post the line it occurred on and the exception/error details as per the How to Create a [mcve] page. Please [edit] your question to add these details into it or we may not be able to help. – Patrick Artner Apr 25 '19 at 14:56

1 Answers1

0

Your traceback shows the source of the exception in for line in reader:

File "import.py", line 45, in importData
    for line in reader:
_csv.Error: line contains NULL byte

and you do not handle exceptions at that point. As the exception suggests, it is raised by your csv reader instance. While you certainly could wrap your for loop in a try-except block, your loop will still end once the exception raises.

This exception may be caused by the file having a different encoding than your locale's, which is assumed by open() if no encoding is explicitly provided:

In text mode, if encoding is not specified the encoding used is platform dependent: locale.getpreferredencoding(False) is called to get the current locale encoding.

The accepted answer in this Q&A outlines a solution to deal with that, provided that you can identify the correct encoding to open the file with. The Q&A also shows some approaches on how to get rid of NULL bytes in the file, prior to handing it over to a reader.

You might also want to simply skip empty lines instead of firing them to your DB and handle the exception, e.g.

for line in reader:
    if not line:
        continue
    try:
        [...]
shmee
  • 4,721
  • 2
  • 18
  • 27