2

I have the following dilemma. I am parsing huge CSV files, which theoretically can contain invalid records, with python. To be able to fix an issue quickly I would like to see the line numbers in the error messages. However, as I am parsing many files and errors are very rare, I do not want my error handling adding overheads to the main pipeline. That is why I would not like to use enumerate or a similar approach.

In a nutshell, I am looking for a get_line_number function to work like this:

with open('file.csv', 'r') as f:
    for line in f:
        try:
            process(line)
        except:
            line_no = get_line_number(f)
            raise RuntimeError('Error while processing the line ' + line_no)

However, this seems to be complicated, as f.tell() will not work in this loop.

EDIT:

It seems like overheads are quite significant. In my real world case (which is painful, as the files are lists of pretty short records: single floats, int-float pairs or string-int pairs; the file.csv is about 800MB large and has around 80M lines), it is about 2.5 seconds per file read for enumerate. For some reason, fileinput is extremely slow.

import timeit
s = """
with open('file.csv', 'r') as f:
    for line in f:
        pass
"""
print(timeit.repeat(s, number = 10, repeat = 3))
s = """
with open('file.csv', 'r') as f:
    for idx, line in enumerate(f):
        pass
"""
print(timeit.repeat(s, number = 10, repeat = 3))
s = """
count = 0
with open('file.csv', 'r') as f:
    for line in f:
        count += 1
"""
print(timeit.repeat(s, number = 10, repeat = 3))
setup = """
import fileinput
"""
s = """
for line in fileinput.input('file.csv'):
    pass
"""
print(timeit.repeat(s, setup = setup, number = 10, repeat = 3))

outputs

[45.790788270998746, 44.88589363079518, 44.93949336092919]
[70.25306860171258, 70.28569177398458, 70.2074502906762]
[75.43606997421011, 74.39759518811479, 75.02027251804247]
[325.1898657102138, 321.0400970801711, 326.23809849238023]

EDIT 2:

Getting close to the real-world scenario. The try-except clause is outside of the loop to reduce the overhead.

import timeit
setup = """
def process(line):
    if float(line) < 0.5:
        outliers += 1
"""
s = """
outliers = 0
with open('file.csv', 'r') as f:
    for line in f:
        process(line)
"""
print(timeit.repeat(s, setup = setup, number = 10, repeat = 3))
s = """
outliers = 0
with open('file.csv', 'r') as f:
    try:
        for idx, line in enumerate(f):
            process(line)
    except ValueError:
        raise RuntimeError('Invalid value in line' + (idx + 1)) from None
"""
print(timeit.repeat(s, setup = setup, number = 10, repeat = 3))

outputs

[244.9097429071553, 242.84596176538616, 242.74369075801224
[293.32093235617504, 274.17732743313536, 274.00854821596295]

So, in my case, overhead from the enumerate is around 10%.

Roman
  • 2,225
  • 5
  • 26
  • 55
  • 2
    So, I have to ask, is the problem that your example that will work runs too slowly or that you *think* it *might* run too slowly? How much impact does it actually have on perf? Have you measured the difference on a file you know has no errors? – Jared Smith Feb 15 '17 at 13:33
  • Wow, would not have expected a 2x slow down. How much impact does wrapping your `process(line)` call in `try/catch` have? – Jared Smith Feb 15 '17 at 14:34
  • Neither would I, but substituting `pass` for whatever one would really do with the data is not the fairest of comparisons. Also just ten bytes per line of a csv file is pretty unusual. – nigel222 Feb 15 '17 at 15:00
  • @nigel222: what is your platform? – Roman Feb 15 '17 at 15:23
  • 2
    You could run the whole loop inside of `try/except` reducing the overhead. – VPfB Feb 15 '17 at 15:31
  • I think if you want to reduce overhead, your best bet is eliminating the function-call to `process` and putting its code in-line inside the `try` block. I jusr re-ran the timing test with `foo(line)` in place of `pass` with `def foo(x): pass`, and the overhead is several times greater than that of enumerate. – nigel222 Feb 15 '17 at 15:48
  • @nigel222: The question is not about boosting python code in general, but specifically about overheads caused by error handling. – Roman Feb 15 '17 at 15:55
  • Well, I'd call 10% acceptable. You can also eliminate the `try` block if you inline the `process` code. Instead, before the loop, initialize `error_list=[]`, during the loop, `error_list.append(line_no); continue` whenever an error is detected, and after the loop, print out the list of erroneous lines and abort (if the list is non-null). – nigel222 Feb 15 '17 at 16:06
  • Sorry maybe incorrect assumption in previous comment that you are parsing the csv file and raising ValueError yourself. If you merely do something like `float(line)` then `try ... except ValueError` will be best – nigel222 Feb 15 '17 at 16:16

3 Answers3

2

Do use enumerate

for line_ref, line in enumerate(f):
    line_no = line_ref + 1  # enumerate starts at zero

It's not adding any significant overhead. The work involved in getting records out of a file vastly exceeds the work involved in keeping a counter, and the tuple assignment in the for statement is just a name-binding, not an extra copy of the data referred to by line

Replacement update:

Made a mistake in generating my test file. Have now pretty much confirmed the first timing test added to the question.

Personally I'd regard a 10% overhead on a worst(ish)-case file with 10-byte records as completely acceptable, given that the alternative is not knowing which of 80 million records were in error.

nigel222
  • 7,582
  • 1
  • 14
  • 22
  • OP said he does not want to use `enumerate`. If your going to post an answer using something he said he didn't want to use, you could at least explain *why* he'd want to use it. – Jared Smith Feb 15 '17 at 13:45
  • Use `enumerate(f, 1)` to start counting from 1. – VPfB Feb 15 '17 at 14:30
1

If you are sure adding debugging info is too much overhead (I do not want to argue on that topic), you could implement two versions of the function. High performance one and one with thorough checking and verbose debugging. The basic idea is:

try:
    func_quick(args)
except Exception:
    func_verbose(args)

The drawback is that the processing will start again when an error occurs. But if you have to manually correct the error, penalty of several seconds wasted in such case should not harm. Also the func_verbose() doesn't have to stop on first error and may check the whole file and list all errors.

VPfB
  • 14,927
  • 6
  • 41
  • 75
0

The standard library fileinput module memory-efficiently processes large files and provides a built-in line number counter. It also automatically picks up multiple filenames for files to read from the command line arguments. However there doesn't seem to be a (simple?) way to use it with context handlers.

As for performance, you'd need to test it in comparison with other approaches.

import fileinput

for line in fileinput.input():
    try:
        process(line)
    except:
        line_no = fileinput.filelineno()
        raise RuntimeError('Error while processing the line ' + line_no)

BTW I'd recommend catching only relevant exceptions, probably a custom one, otherwise you'll mask out unanticipated exceptions.

Simon Hibbs
  • 5,941
  • 5
  • 26
  • 32