I have to read a tomcat access-log that has lines like:
[06/Nov/2020:00:43:04 +0200] /wsi/services/ServicesReadRequest 2265 10.101.101.101 200 21
trying to read the file as csv, setting all columns as string type
import pandas as pd
headers = ['Timestamp', 'Command', 'IPAddr', 'Blank01', 'Blank02',
'Bytes', 'HTTPResult', 'ElapsedTime']
dtypes = {'Timestamp': 'str', 'Command': 'str', 'IPAddr': 'str', 'Blank01' : 'str',
'Blank02' : 'str', 'Bytes': 'str', 'HTTPResult': 'str', 'ElapsedTime': 'str'}
df = pd.read_csv(fpath, delimiter=' ', header=None, names=headers,
dtype=dtypes, warn_bad_lines=True, error_bad_lines=False)
What happens is the square brackets around the timestamp are handled specially by pandas
df['Timestamp'].head()
shows:
[06/Nov/2020:00:43:04 +0200] /wsi/services/ServicesReadRequest
if I try to cut the string, it looks like the part with the squared bracket is ignored
df["Timestamp"].apply(lambda x: x[1:6]).head()
results:
[06/Nov/2020:00:43:04 +0200] /wsi/s
if I remove the square brackets manually, then it works as expected (although the time zone gets separated from the timestamp, but that is because it has a space between). Now the question is how to parse the file without any pre-processing? Is there an alternative to read_csv, that does not include such side-effects?