0

With the following python code

import csv
log_file = open('190415190514.txt', 'r')
all_data = csv.reader(log_file, delimiter=' ')
data = []
for row in all_data:
    data.append(row)

to read a big file containing

2019-04-15 00:00:46 192.168.168.29 GET / - 443 - 192.168.168.80 Mozilla/5.0+(compatible;+PRTG+Network+Monitor+(www.paessler.com);+Windows) - 200 0 0 0

I get this error

 File "main.py", line 5, in <module>
   for row in datareader:
 File "/usr/lib/python3.6/codecs.py", line 321, in decode
   (result, consumed) = self._buffer_decode(data, self.errors, final)
 UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 1284: invalid start byte

I think there is no problem with the data file since it is a IIS log file. If there is any encoding issue, how can I locate that line? I am also not sure if my problem is the same this one.

mahmood
  • 23,197
  • 49
  • 147
  • 242

1 Answers1

1

Since you opened the file as 'r' instead of 'rb', python is trying to decode it as utf-8. The contents of the file are apparently not valid utf-8, so you're getting an erorr. You can find the line number of the offending line like this:

with open('190415190514.txt', 'rb') as f:
    for i, line in enumerate(f):
        try:
            line.decode('utf-8')
        except UnicodeDecodeError as e:
            print (f'{e} at line {i+1}')

You probably should be passing errors or encoding to open. see: https://docs.python.org/3/library/functions.html#open

Lawrence D'Anna
  • 2,998
  • 2
  • 22
  • 25