I'm getting CSV column #10: CSV conversion error to string: invalid UTF8 data
while converting a large csv to parquet. From the looks of the error, it seems like appropriate column data can't be converted to String
type because of invalid utf-8 chars existence.
Maybe this could be fixed by using proper encoding scheme in pyarrow.ReadOptions. But I'm curious to know about error causing line.
Since it's a large file having millions of lines, I can't identify the error causing line.
Is there any option in pyarrow read_csv func to report bad lines? Or it would be better if we can replace that particular cell with NAN
OR NULL.