0

I have the following to extract data from a CSV into an SQLite db:

    elif file.endswith('.csv'):
    read_path = os.path.join(dir_read, file)
    with open(read_path,'r',encoding='utf-8') as fin:
        dr = csv.DictReader(fin)
        to_db = [(i['InvoiceNumber'],i['InvoiceType'], i['ChargeType'],i['SupplierID'], i['Net_Amount']) for i in dr]
    
    c.executemany("INSERT INTO mf (InvoiceNumber, InvoiceType, ChargeType, SupplierID, Net_Amount) VALUES (?, ?, ?, ?, ?) ;", to_db)
    con.commit()

This works for most of the files I am importing (200 files), but there are a couple with a NUL in one of the fields, this is not one of the fields in the to_db. I can see this in Notepad++, one row has a null all the others have an empty string. The error I get is:

_csv.Error: line contains NULL byte

I have tried different encoding (utf-16, le, be), but then they fail on all the other csv files.

Is there a way to remove these NULL values and replace with a blank string when importing?

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
dhoggard
  • 5
  • 3
  • This has nothing to do with encodings. Whoever created that file decided to emit a NUL for missing values instead of an empty string or the `NULL` string. You'll have to handle this explicitly – Panagiotis Kanavos Dec 10 '20 at 11:59
  • BTW you aren't extracting data from a CSV into SQLite, you're *importing* it. Something that can be done from the command line or the `.import` SQLite command. You can even handle that file as a [CSV virtual table](https://sqlite.org/csv.html). – Panagiotis Kanavos Dec 10 '20 at 12:01
  • BTW NUL is valid in UTF8 – Panagiotis Kanavos Dec 10 '20 at 12:14

1 Answers1

0

You can use codec library to handle null byte error.

import csv
import codecs
csv_reader = csv.reader(codecs.open(read_path, 'rU', 'utf-16'))
Dharman
  • 30,962
  • 25
  • 85
  • 135
  • How is this going to fix the problem? `NUL` is valid in UTF8 and the OP already tried different encodings, including `utf-16`. The file is definitely *not* UTF16, otherwise all English characters, including commas, would have a leading `0x00` – Panagiotis Kanavos Dec 10 '20 at 12:15
  • Hi Panagiotis, thanks for your feedback. Actually the thing that supposed to work is the codec.open() function. And you may be right about the encoding because he says none of them worked. We are not 100% sure about the encoding part since we do not have the file itself to try. But codecs.open method may sometimes help some problematic issues, such as null byte error. You may have a look at here if you are interested https://docs.python.org/3/library/codecs.html – Caner Burc BASKAYA Dec 10 '20 at 12:46
  • The null byte isn't an error in UTF8 or any other encoding. `open` already handles encodings and the page you linked to says `While the builtin open() and the associated io module are the recommended approach for working with encoded text files, this module provides additional utility functions and classes that allow the use of a wider range of codecs when working with binary files`. That's not the case with `codecs.open`, in fact there are plans to deprecate it – Panagiotis Kanavos Dec 10 '20 at 13:43
  • Check [Difference between open and codecs.open in Python](https://stackoverflow.com/questions/5250744/difference-between-open-and-codecs-open-in-python). Deprecation hasn't happened yet for compatibility reasons, not because there are missing features – Panagiotis Kanavos Dec 10 '20 at 13:44