2

I'm getting CSV column #10: CSV conversion error to string: invalid UTF8 data while converting a large csv to parquet. From the looks of the error, it seems like appropriate column data can't be converted to String type because of invalid utf-8 chars existence.

Maybe this could be fixed by using proper encoding scheme in pyarrow.ReadOptions. But I'm curious to know about error causing line.

Since it's a large file having millions of lines, I can't identify the error causing line.

Is there any option in pyarrow read_csv func to report bad lines? Or it would be better if we can replace that particular cell with NAN OR NULL.

Avinash Raj
  • 172,303
  • 28
  • 230
  • 274
  • Do you have a reproducable example? You could add this option `convert_options=csv.ConvertOptions(column_types={"#10": pa.binary()})`. It will treat columns "#10" as a binary string. Then once the data is loaded you can try to find out which row isn't utf8 compatible. Unforntunately arrow doesn't have a `compute` function to check if a string is valid utf-8 data (as far as I can tell), so you'll have to run the check manually. – 0x26res Sep 13 '21 at 13:49

2 Answers2

1

There is no option today to report the line number or the failing line. There is some ongoing work to improve error handling but even that work does not yet reveal the line number of decode errors. I'd recommend creating a JIRA issue.

As @0x26res correctly stated, you can specify the column as binary and it then inspect it manually in memory. You can use the cast compute function to cast from binary to string and that will perform UTF8 validation but, unfortunately, it does not report the failed index today either.

As a workaround you can use the pandas CSV parser which should give you the byte offset of the failure:

>>> import pandas
>>> pandas.read_csv("/tmp/blah.csv")
Traceback (most recent call last):
  ... # Omitted for brevity
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 29: invalid start byte
Pace
  • 41,875
  • 13
  • 113
  • 156
1

I ran into the exact same problem.

If this is an option for you, you can find the offending lines on the command line with grep -axv '.*' file.csv (from this answer).

But, I ended up not fixing the problem in the file, but just scrubbing the input as I read it in, using the following code (slightly modified from what I'm actually using, so no guarantees this will work perfectly as-is)

import io
import pyarrow.csv as pv

class UnicodeErrorIgnorerIO(io.IOBase):
    """Simple wrapper for a BytesIO that removes non-UTF8 input.

    If a file contains non-UTF8 input, it causes problems in pyarrow and other libraries
    that try to decode the input to unicode strings. This just removes the offending bytes.

    >>> io = io.BytesIO(b"INT\xbfL LICENSING INDUSTRY MERCH ASSOC")
    >>> io = UnicodeErrorIgnorerIO(io)
    >>> io.read()
    'INTL LICENSING INDUSTRY MERCH ASSOC'
    """

    def __init__(self, file: io.BytesIO) -> None:
        self.file = file

    def read(self, n=-1):
        return self.file.read(n).decode("utf-8", "ignore").encode("utf-8")

    def readline(self, n=-1):
        return self.file.readline(n).decode("utf-8", "ignore").encode("utf-8")

    def readable(self):
        return True


def read_csv(path: pathlib.Path):
    with open(path, "rb") as f:
        f = UnicodeErrorIgnorerIO(f)
        return pv.read_csv(f)
metadaddy
  • 4,234
  • 1
  • 22
  • 46
Nick Crews
  • 837
  • 10
  • 13