1

I have written a function containing regex to separate some special parts of a txt file. The code works fine but I would like to get a dictionary as an output from this and the length should be 979:

import re

def logs():
    with open("C:/Users/ASUS/Desktop/logdata.txt", "r") as file:
        logdata = file.read()

    pattern = ''' 
    (?P<host>\d{1,}\.\d{1,}\.\d{1,}\.\d{1,})    # host name
    \s+\S+\s+
    (?P<user_name>(?<=-\s)(\w+|-)(?=\s))\s+\[   # user_name
    (?P<time>([^[]+))\]\s+"                     # time
    (?P<request>[^"]+)"                         # request
    '''

    for item in re.finditer(pattern, logdata, re.VERBOSE):
        print(item.groupdict())

This function is supposed to turn a text like this:

146.204.224.152 - feest6811 [21/Jun/2019:15:45:24 -0700] "POST /incentivize HTTP/1.1" 302 4622

to this capturing host, user_name etc:

{"host":"146.204.224.152", 
 "user_name":"feest6811", 
 "time":"21/Jun/2019:15:45:24 -0700",
 "request":"POST /incentivize HTTP/1.1"}

How can I do this?

Anoushiravan R
  • 21,622
  • 3
  • 18
  • 41

2 Answers2

1

Just use groupdict() directly:

import re 

def rtr_dict(txt):
  pattern = ''' 
  (?P<host>\d{1,}\.\d{1,}\.\d{1,}\.\d{1,})   # host name
  \s+\S+\s+
  (?P<user_name>(?<=-\s)(\w+|-)(?=\s))\s+\[   # user_name
  (?P<time>([^[]+))\]\s+"   # time
  (?P<request>[^"]+)"   # request
  '''
  
  if m:=re.match(pattern, txt, flags=re.VERBOSE):
    return m.groupdict()

tgt='146.204.224.152 - feest6811 [21/Jun/2019:15:45:24 -0700] "POST /incentivize HTTP/1.1" 302 4622'


>>>rtr_dict(tgt)
{'host': '146.204.224.152', 'user_name': 'feest6811', 'time': '21/Jun/2019:15:45:24 -0700', 'request': 'POST /incentivize HTTP/1.1'}

Just could you please tell me how I could make it for more than just one line just the way I used a for loop for that.

Given:

tgt='''146.204.224.152 - feest6811 [21/Jun/2019:15:45:24 -0700] "POST /incentivize HTTP/1.1" 302 4622
146.204.224.153 - feest6811 [21/Jun/2019:15:45:24 -0700] "POST /incentivize HTTP/1.1" 302 4623
146.204.224.154 - feest6811 [21/Jun/2019:15:45:24 -0700] "POST /incentivize HTTP/1.1" 302 4624'''

If you have more than one match, you can return a list of dicts:

def rtr_dict(txt):
  pattern = ''' 
  (?P<host>\d{1,}\.\d{1,}\.\d{1,}\.\d{1,})   # host name
  \s+\S+\s+
  (?P<user_name>(?<=-\s)(\w+|-)(?=\s))\s+\[   # user_name
  (?P<time>([^[]+))\]\s+"   # time
  (?P<request>[^"]+)"   # request
  '''
  
  return [m.groupdict() for m in re.finditer(pattern, txt, flags=re.VERBOSE)]

>>> rtr_dict(tgt)
[{'host': '146.204.224.152', 'user_name': 'feest6811', 'time': '21/Jun/2019:15:45:24 -0700', 'request': 'POST /incentivize HTTP/1.1'}, {'host': '146.204.224.153', 'user_name': 'feest6811', 'time': '21/Jun/2019:15:45:24 -0700', 'request': 'POST /incentivize HTTP/1.1'}, {'host': '146.204.224.154', 'user_name': 'feest6811', 'time': '21/Jun/2019:15:45:24 -0700', 'request': 'POST /incentivize HTTP/1.1'}]

Or use a generator:

def rtr_dict(txt):
  pattern = ''' 
  (?P<host>\d{1,}\.\d{1,}\.\d{1,}\.\d{1,})   # host name
  \s+\S+\s+
  (?P<user_name>(?<=-\s)(\w+|-)(?=\s))\s+\[   # user_name
  (?P<time>([^[]+))\]\s+"   # time
  (?P<request>[^"]+)"   # request
  '''
  
  for m in re.finditer(pattern, txt, flags=re.VERBOSE):
    yield m.groupdict()

>>> list(rtr_dict(tgt))
# same list of dicts...
dawg
  • 98,345
  • 23
  • 131
  • 206
  • Thank you very much. The output is exactly what I was looking for. Just could you please tell me how I could make it for more than just one line just the way I used a for loop for that. – Anoushiravan R Jan 09 '22 at 23:59
  • 1
    Updated: Use a list comprehension or generator... – dawg Jan 10 '22 at 00:14
  • This is perfect. Thank you very much indeed. I am very new to this concepts. I would highly appreciate it if you could add some notes in list comprehension part. I will learn about them however in time. Thank you again :) – Anoushiravan R Jan 10 '22 at 00:19
1

Its late but this Verbose regex will also do (to return a list of dictionaries]

import re
def logs():
    with open("C:/Users/ASUS/Desktop/logdata.txt", "r") as file:
        logdata = file.read()
    
    pattern = """
    (?P<host>[\d\.]*)       #IP host
    (\ -\ )                 #followed by 
    (?P<user_name>[\w-]*)   #user name
    (\ *\[)                 #followed by 
    (?P<time>[^\]]*)        #time
    (\]\ *")                #followed by 
    (?P<request>[^\"]*)     #request"""

    return [item.groupdict() for item in re.finditer(pattern, logdata, re.VERBOSE)]
AnilGoyal
  • 25,297
  • 4
  • 27
  • 45