I have the following dilemma. I am parsing huge CSV files, which theoretically can contain invalid records, with python
. To be able to fix an issue quickly I would like to see the line numbers in the error messages. However, as I am parsing many files and errors are very rare, I do not want my error handling adding overheads to the main pipeline. That is why I would not like to use enumerate
or a similar approach.
In a nutshell, I am looking for a get_line_number
function to work like this:
with open('file.csv', 'r') as f:
for line in f:
try:
process(line)
except:
line_no = get_line_number(f)
raise RuntimeError('Error while processing the line ' + line_no)
However, this seems to be complicated, as f.tell()
will not work in this loop.
EDIT:
It seems like overheads are quite significant. In my real world case (which is painful, as the files are lists of pretty short records: single floats, int-float pairs or string-int pairs; the file.csv
is about 800MB large and has around 80M lines), it is about 2.5 seconds per file read for enumerate
. For some reason, fileinput
is extremely slow.
import timeit
s = """
with open('file.csv', 'r') as f:
for line in f:
pass
"""
print(timeit.repeat(s, number = 10, repeat = 3))
s = """
with open('file.csv', 'r') as f:
for idx, line in enumerate(f):
pass
"""
print(timeit.repeat(s, number = 10, repeat = 3))
s = """
count = 0
with open('file.csv', 'r') as f:
for line in f:
count += 1
"""
print(timeit.repeat(s, number = 10, repeat = 3))
setup = """
import fileinput
"""
s = """
for line in fileinput.input('file.csv'):
pass
"""
print(timeit.repeat(s, setup = setup, number = 10, repeat = 3))
outputs
[45.790788270998746, 44.88589363079518, 44.93949336092919]
[70.25306860171258, 70.28569177398458, 70.2074502906762]
[75.43606997421011, 74.39759518811479, 75.02027251804247]
[325.1898657102138, 321.0400970801711, 326.23809849238023]
EDIT 2:
Getting close to the real-world scenario. The try-except
clause is outside of the loop to reduce the overhead.
import timeit
setup = """
def process(line):
if float(line) < 0.5:
outliers += 1
"""
s = """
outliers = 0
with open('file.csv', 'r') as f:
for line in f:
process(line)
"""
print(timeit.repeat(s, setup = setup, number = 10, repeat = 3))
s = """
outliers = 0
with open('file.csv', 'r') as f:
try:
for idx, line in enumerate(f):
process(line)
except ValueError:
raise RuntimeError('Invalid value in line' + (idx + 1)) from None
"""
print(timeit.repeat(s, setup = setup, number = 10, repeat = 3))
outputs
[244.9097429071553, 242.84596176538616, 242.74369075801224
[293.32093235617504, 274.17732743313536, 274.00854821596295]
So, in my case, overhead from the enumerate
is around 10%.