0

I have these lines

5.10.80.69 - - [21/Jun/2019:15:46:20 -0700] "PATCH /niches/back-end HTTP/2.0" 406 15834
11.57.203.39 - carroll8889 [21/Jun/2019:15:46:21 -0700] "HEAD /visionary/cultivate HTTP/1.1" 404 15391
124.137.187.175 - - [21/Jun/2019:15:46:22 -0700] "DELETE /expedite/exploit/cultivate/web-enabled HTTP/1.0" 403 2606
203.36.55.39 - collins6322 [21/Jun/2019:15:46:23 -0700] "PATCH /efficient/productize/disintermediate HTTP/1.1" 504 13377
175.5.52.40 - - [21/Jun/2019:15:46:24 -0700] "POST /real-time HTTP/1.1" 200 2660
232.220.131.214 - - [21/Jun/2019:15:46:25 -0700] "GET /wireless/matrix/synergistic/expedite HTTP/1.1" 205 15081
87.234.209.125 - labadie6990 [21/Jun/2019:15:46:26 -0700] "GET /unleash/aggregate HTTP/2

and I need to put them in an array like this:

example_dict = {"host":"146.204.224.152", 
                "user_name":"feest6811", 
                "time":"21/Jun/2019:15:45:24 -0700",
                "request":"POST /incentivize HTTP/1.1"}

This is what I have done:

import re
def logs():
    with open("assets/logdata.txt", "r") as file:
        logdata = file.read()
        return logdata
    
partes = [
    r'(?P<host>\S+)',                   # host %h
    r'\S+',                             # indent %l (unused)
    r'(?P<user>\S+)',                   # user %u
    r'\[(?P<time>.+)\]',                # time %t
    r'"(?P<request>.*)"',               # request "%r"
    r'(?P<status>[0-9]+)',              # status %>s
    r'(?P<size>\S+)',                   # size %b (careful, can be '-')
    r'"(?P<referrer>.*)"',              # referrer "%{Referer}i"
    r'"(?P<agent>.*)"',                 # user agent "%{User-agent}i"
]

pattern = re.compile(r'\s+'.join(partes)+r'\s*\Z')

log_data = []

for line in logs():
  log_data.append(pattern.match(line).groupdict())
    
print (log_data)

But I have this errror:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-2-029948b6e367> in <module>
     23 # Get components from each line of the log file into a structured dict
     24 for line in logs():
---> 25   log_data.append(pattern.match(line).groupdict())
     26 
     27 

AttributeError: 'NoneType' object has no attribute 'groupdict'

This error is obviusly because the regex is wrong, but not sure why, the code is taken from here:

https://gist.github.com/sumeetpareek/9644255

Update:

    import re
    def logs():
        with open("assets/logdata.txt", "r") as file:
            logdata = file.read()
            return logdata

regex="^(\S+) (\S+) (\S+) \[([\w:/]+\s[+\-]\d{4})\] "(\S+)\s?(\S+)?\s?(\S+)?" (\d{3}|-) (\d+|-)\s?"?([^"]*)"?\s?"?([^"]*)?"?$"

log_data = []

for line in logs():
    m = pattern.match(line)
    log_data.append(re.findall(regex, line).groupdict())
    
print (log_data)

But I get this error:unexpected character after line continuation character

Update 2:

when adding the items to a dictionary, the items must arrive in this format:

assert len(logs()) == 979

one_item={'host': '146.204.224.152',
  'user_name': 'feest6811',
  'time': '21/Jun/2019:15:45:24 -0700',
  'request': 'POST /incentivize HTTP/1.1'}
assert one_item in logs(), "Sorry, this item should be in the log results, check your formating"
Luis Valencia
  • 32,619
  • 93
  • 286
  • 506

1 Answers1

1

Since there are a lot of issues with the solution you have, please consider revamping it completely.

The regex that should work for you is

^(?P<host>\S+) +\S+ +(?P<user>\S+) +\[(?P<time>[\w:/]+ +[+-]\d{4})] +"(?P<request>\S+) +(?P<status>\S+) +(?P<size>\S+)" +(?P<someid>\d{3}|-) +(?P<someid2>\d+|-)(?: +"(?P<referrer>[^"]*)"(?: +"(?P<agent>[^"]*)")?)?$

See the regex demo. Note the last (?: +"([^"]*)"(?: +"([^"]*)")?)? part matches two optional sequences of patterns and the last one is only matched if the first is matched.

The code you can leverage may look like

import re

pattern = re.compile(r'''^(?P<host>\S+) +\S+ +(?P<user>\S+) +\[(?P<time>[\w:/]+ +[+-]\d{4})] +"(?P<request>\S+) +(?P<status>\S+) +(?P<size>\S+)" +(?P<someid>\d{3}|-) +(?P<someid2>\d+|-)(?: +"(?P<referrer>[^"]*)"(?: +"(?P<agent>[^"]*)")?)?$''')

log_data = []

with open("assets/logdata.txt", "r") as file:
  for line in file:
    m = pattern.search(line.strip())
    if m:
      log_data.append(m.groupdict())

print(log_data)

See the Python demo

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • thank you, but this only returns me an array of empty objects: [{}, {}, {}, etc... – Luis Valencia Jan 25 '21 at 10:06
  • @LuisValencia Put back the named groups, and it will. See https://ideone.com/rQVSfs, I am just not sure what the group names must be. – Wiktor Stribiżew Jan 25 '21 at 10:09
  • there is one caveat that I still dont know how to fix, the regex works perfectly fine, but it splits the request into several groups, but in order for the exercise to be good, the request should be one group alone: (see update 2). – Luis Valencia Jan 25 '21 at 10:24
  • 1
    @LuisValencia Then why complicate the pattern in the first place? Use `r'''^(?P\S+) +\S+ +(?P\S+) +\[(?P – Wiktor Stribiżew Jan 25 '21 at 10:39