-2

I have a number of supposed csv's but in fact they have some rows with different numbers of fields. I would like to found out which rows these are and look at them. If the csv's weren't broken I would just use pandas and do:

df = pd.read_csv("file.csv")

But this isn't suitable for data cleaning and preprocessing I need to do.

How can I find the number of fields in each row in a "csv" file? Is it, for example, possible to just read in one row at a time, without remembering the number of fields from previous rows?

Simd
  • 19,447
  • 42
  • 136
  • 271
  • Why the downvote? – Simd Jun 08 '18 at 20:54
  • You can visually get a list of all "bad" lines by calling `pd.read_csv('file.csv',error_bad_lines=False)`. I am not sure you can store it in a variable for further processing. – DYZ Jun 08 '18 at 21:07
  • [Possible duplicate](https://stackoverflow.com/questions/32334966/pandas-bad-lines-warning-capture). – DYZ Jun 08 '18 at 21:09

2 Answers2

1

CSV is not a fully defined standard, so close to RFC 4180 you can do something like this

import re
with open('file.csv', 'r') as f:
    print([re.sub(r'("[^"]*),([^"]*")', r'\1<comma>\2', l).count(',') for l in f.readlines()])

which counts the commas after replacing the ones enclosed in double quotes.

Community
  • 1
  • 1
Diego Torres Milano
  • 65,697
  • 9
  • 111
  • 134
-1

It seems the following works.

import csv
def f(s):
    return map(len,csv.reader(s.split("\n"))
Simd
  • 19,447
  • 42
  • 136
  • 271