The HTTP log files I'm trying to analyze with pandas have sometimes unexpected lines. Here's how I load my data :
df = pd.read_csv('mylog.log',
sep=r'\s(?=(?:[^"]*"[^"]*")*[^"]*$)(?![^\[]*\])',
engine='python', na_values=['-'], header=None,
usecols=[0, 3, 4, 5, 6, 7, 8,10],
names=['ip', 'time', 'request', 'status', 'size',
'referer','user_agent','req_time'],
converters={'status': int, 'size': int, 'req_time': int})
It works fine for most of the logs I have (which come from the same server). However, upon loading some logs, an exception is raised : either
TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'
or
ValueError: invalid literal for int() with base 10: '"GET /agent/10577/bdl HTTP/1.1"'
For the sake of the example, here's the line that triggers the second exception:
22.111.117.229, 22.111.117.229 - - [19/Sep/2018:22:17:40 +0200] "GET /agent/10577/bdl HTTP/1.1" 204 - "-" "okhttp/3.8.0" apibackend.site.fr 429282
To find the number of the incriminated line, I used the following (terribly slow) function :
def search_error_dichotomy(path):
borne_inf = 0
log = open(path)
borne_sup = len(log.readlines())
log.close()
while borne_sup - borne_inf>1:
exceded = False
search_index = (borne_inf + borne_sup) // 2
try:
pd.read_csv(path,...,...,nrows=search_index)
except:
exceded = True
if exceded:
borne_sup = search_index
else:
borne_inf = search_index
return search_index
What I'd like to have is something like this :
try:
pd.read_csv(..........................)
except MyError as e:
print(e.row_number)
where e.row_number is the number of the messy line.
Thank you in advance.
SOLUTION All credits to devssh, whose suggestion not only makes the process quicker, but allows me to get all unexpected line at once. Here's what I did out of it :
Load the dataframe without converters.
df = pd.read_csv(path, sep=r'\s(?=(?:[^"]*"[^"]*")*[^"]*$)(?![^\[]*\])', engine='python', na_values=['-'], header=None, usecols=[0, 3, 4, 5, 6, 7, 8,10], names=['ip', 'time', 'request', 'status', 'size', 'referer', 'user_agent', 'req_time'])
Add an 'index' column using .reset_index() .
df = df.reset_index()
Write custom function (to be used with apply), that converts to int if possible, otherwise saves the entry and the 'index' in a dictionary wrong_lines
wrong_lines = {} def convert_int_feedback_index(row,col): try: ans = int(row[col]) except: wrong_lines[row['index']] = row[col] ans = pd.np.nan return ans
Use apply on the columns I want to convert (eg col = 'status', 'size', or 'req_time')
df[col] = df.apply(convert_int_feedback_index, axis=1, col=col)