2

I have a CSV file in which I am using Python to parse. I found that some rows in the file have different number of columns.

001;Snow,Jon;19801201
002;Crom,Jake;19920103
003; ;Wise,Frank;19880303   <-- Invalid row
004;Wiseau,Tommy;4324;1323;2323  <-- Invalid row

I would like to write these invalid rows into a separate text file.

I used this line of code to read from the file.

df = pd.read_csv('names.csv', header=None,sep=';')

One solution I found here was to skip the problematic rows using the following code:

data = pd.read_csv('file1.csv', on_bad_lines='skip')

I could change from 'skip' to 'warn', which will give the row number of the problematic row and skip the row. But this will return warning messages and not the row itself.

Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158

2 Answers2

6

Since pandas 1.4.0 allows callable for on_bad_lines parameter - that allows you to apply a more sophisticated handling of bad lines.

New in version 1.4.0:

callable, function with signature (bad_line: list[str]) -> list[str] | None that will process a single bad line. bad_line is a

list of strings split by the sep. If the function returns None, the bad line will be ignored. If the function returns a new list of strings with more elements than expected, a ParserWarning will be emitted while dropping extra elements. Only supported when engine="python"


So you can pass a custom function which will write an encountered bad line into specific file and return None (to skip that line on dataframe generation).

from functools import partial

def write_bad_line(line, fp, sep=','):
    fp.write(sep.join(line) + '\n')
    return None  # return None to skip the line while processing

bad_lines_fp = open('bad_lines.csv', 'a')
df = pd.read_csv('test.csv', header=None, sep=';', engine='python',
                 on_bad_lines=partial(write_bad_line, sep=';', fp=bad_lines_fp))
bad_lines_fp.close()
print(df)

The output of the dataframe:

   0          1         2
0  1   Snow,Jon  19801201
1  2  Crom,Jake  19920103

The contents of bad_lines.csv (via cat command):

$ cat bad_lines.csv
003; ;Wise,Frank;19880303
004;Wiseau,Tommy;4324;1323;2323
RomanPerekhrest
  • 88,541
  • 4
  • 65
  • 105
  • `with open('bad_lines.csv', 'a') as fp: df = pd.read_csv('test.csv', header=None, sep=';', engine='python', on_bad_lines=partial(write_bad_line, sep=';', fp=fp))` to avoid the need for `bad_lines_fp.close()` – Trenton McKinney Apr 05 '23 at 04:50
0

You could split the csv file with a script that you run before loading in Pandas. Such as;

with open('names.csv') as src, open('good.csv', 'w') as good, open('bad.csv', 'w') as bad:
    for line in src:
        if line.count(';') == 2: # or any other appropriate criteria
            good.write(line)
        else:
            bad.write(line)
bn_ln
  • 1,648
  • 1
  • 6
  • 13