using pandas.read_csv, how can one process all errors, receive all non-error data?

Question

Data which, for me, generates an exception instead of invoking the 'on_bad_lines' handler is at:

https://opencalaccess.org/misc/NAMES_CD.TSV

I have this:

bad_lines = list()

def bad_line_finder(x):
    bad_lines.append(str(x))
    return None


for file in os.listdir(dir):
    bad_lines = list()

    try:
        for df in pd.read_csv(f"{dir}/{file}",
                              sep='\t',
                              on_bad_lines=bad_line_finder,
                              engine='python',
                              chunksize=1000):

            print(f"\n{target}")
            df.info()

            print(f"Bad Lines: {bad_lines}")
            bad_lines = list()

    except:
        print("EXCEPTION:")
        traceback.print_exc()

and this works great. There are errors in the files and the method handles them so that I can keep track of them. Except, why do i still see this:

EXCEPTION:
Traceback (most recent call last):
  File "/home/ray/Projects/opencalaccess-data/import.py", line 41, in <module>
    for df in pd.read_csv(f"{dir}/{file}",
  File "/home/ray/Projects/opencalaccess-data/.venv/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1698, in __next__
    return self.get_chunk()
  File "/home/ray/Projects/opencalaccess-data/.venv/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1810, in get_chunk
    return self.read(nrows=size)
  File "/home/ray/Projects/opencalaccess-data/.venv/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1778, in read
    ) = self._engine.read(  # type: ignore[attr-defined]
  File "/home/ray/Projects/opencalaccess-data/.venv/lib/python3.10/site-packages/pandas/io/parsers/python_parser.py", line 250, in read
    content = self._get_lines(rows)
  File "/home/ray/Projects/opencalaccess-data/.venv/lib/python3.10/site-packages/pandas/io/parsers/python_parser.py", line 1114, in _get_lines
    new_rows.append(next(self.data))
_csv.Error: '   ' expected after '"'

What is the "on_bad_lines" option doing if it does not handle all of the bad lines? Which of them will it handle and which will it not?

This is a government data source. There are format errors in the data that cannot be corrected by the agency, because they constitute the 0fficial record. So, I must fix them myself. But which of them throw exceptions and which do not?

Tested in `pandas 2.0.0`: `df = pd.read_csv('d:/data/NAMES_CD.TSV', sep='\t', on_bad_lines='warn')` skipped lines 275386 and 383815. `warn` tells you which lines. — Trenton McKinney, Apr 05 '23 at 04:33
`with open('bad_lines.csv', 'a') as fp: df = pd.read_csv('NAMES_CD.TSV', sep='\t', on_bad_lines=partial(write_bad_line, sep=';', fp=fp), engine='python')` — Trenton McKinney, Apr 05 '23 at 04:46

using pandas.read_csv, how can one process all errors, receive all non-error data?

0 Answers0