0

I have to read a tomcat access-log that has lines like:

    [06/Nov/2020:00:43:04 +0200] /wsi/services/ServicesReadRequest  2265 10.101.101.101 200 21

trying to read the file as csv, setting all columns as string type

    import pandas as pd 

    headers = ['Timestamp', 'Command', 'IPAddr', 'Blank01', 'Blank02', 
       'Bytes', 'HTTPResult', 'ElapsedTime']
    dtypes = {'Timestamp': 'str',  'Command': 'str', 'IPAddr': 'str', 'Blank01' : 'str', 
       'Blank02' : 'str', 'Bytes': 'str', 'HTTPResult': 'str', 'ElapsedTime': 'str'} 

    df = pd.read_csv(fpath, delimiter=' ', header=None, names=headers, 
       dtype=dtypes, warn_bad_lines=True, error_bad_lines=False)

What happens is the square brackets around the timestamp are handled specially by pandas

    df['Timestamp'].head()

shows:

    [06/Nov/2020:00:43:04 +0200] /wsi/services/ServicesReadRequest

if I try to cut the string, it looks like the part with the squared bracket is ignored

    df["Timestamp"].apply(lambda x: x[1:6]).head()

results:

    [06/Nov/2020:00:43:04 +0200] /wsi/s

if I remove the square brackets manually, then it works as expected (although the time zone gets separated from the timestamp, but that is because it has a space between). Now the question is how to parse the file without any pre-processing? Is there an alternative to read_csv, that does not include such side-effects?

  • Sounds like your `Timestamp` column has inadvertently become your DataFrame's index. Try adding the argument `index_col=False` to your `pd.read_csv` call. – jfaccioni Mar 01 '21 at 18:58
  • I am getting an unknown string type with the format using pd.to_datetime. have you tried parsing using strptime – Golden Lion Mar 02 '21 at 15:19
  • Thanks! the index_col=False solved the issue. I had to add more columns to get it working, but it working now as expected . – Tamas Deak Mar 03 '21 at 07:47

2 Answers2

0

I parsed the string in the datetime components using a dictionary and regular expression substitution. see(can you write a str.replace() using dictionary values in Python?) then converted the string into a datetime then added the timedelta

 data=["06/Nov/2020:00:43:04 +0200"]
 df=pd.DataFrame(data,columns=['date'])
 def MonthToNum(val):
      dictMonth={'Jan':'1', 'Feb':'2','Mar':'3','Apr':'4','May':'5','Jun':'6','Jul':'7','Aug':'8','Sep':'9','Oct':'10','Nov':'11','Dec':'12'}
      pattern = '|'.join(sorted(re.escape(k) for k in dictMonth))
      retval=re.sub(pattern, lambda m: dictMonth.get(m.group(0)), val, flags=re.IGNORECASE)
     return retval

 df['date']=df['date'].apply(lambda x: pd.to_datetime(MonthToNum(x[0:11])+" "+x[12:20]) + timedelta(int(x[20:24])))
 print(df)
 print(type(df['date']))

output:

  date
0 2020-06-13 00:43:04
Golden Lion
  • 3,840
  • 2
  • 26
  • 35
0

The solution was to add index_col=False to the read_csv command and add some more colums, the timestamp can be converted to datetime

    headers = ['Timestamp', 'Timezone', 'Command', 'Blank01', 'IPAddr', 
       'Blank02', 'Blank03', 'Bytes', 'HTTPResult', 'ElapsedTime']
    dtypes = {'Timestamp': 'str', 'Timezone' : 'str', 'Command': 'str', 
       'Blank01' : 'str', 'IPAddr': 'str', 'Blank02' : 'str', 'Blank03' : 'str', 
       'Bytes': 'str', 'HTTPResult': 'str', 'ElapsedTime': 'str'}

    df = pd.read_csv(fpath, delimiter=' ', index_col=False, header=None, 
       names=headers, dtype=dtypes, warn_bad_lines=True, error_bad_lines=False)
       
    df['Timestamp'] =  pd.to_datetime(df['Timestamp'], format='[%d/%b/%Y:%H:%M:%S', 
       errors='coerce')
    idx4 = df['Timestamp'].isna();