Make dictionary from txt file using re

Question

Consider the standard web log file in assets/logdata.txt. This file records the access a user makes when visiting a web page (like this one!). Each line of the log has the following items:

a host (e.g., '146.204.224.152')
a user_name (e.g., 'feest6811' note: sometimes the user name is missing! In this case, use '-' as the value for the username.)
the time a request was made (e.g., '21/Jun/2019:15:45:24 -0700')
the post request type (e.g., 'POST /incentivize HTTP/1.1' note: not everything is a POST!)

Your task is to convert this into a list of dictionaries, where each dictionary looks like the following:

example_dict = {"host":"146.204.224.152", 
                "user_name":"feest6811", 
                "time":"21/Jun/2019:15:45:24 -0700",
                "request":"POST /incentivize HTTP/1.1"}

This is sample of the txt data file.

sample of the text file

I wrote these lines of codes:

import re
def logs():
    with open("assets/logdata.txt", "r") as file:
        logdata = file.read()
        #print(logdata)
        pattern="""
        (?P<host>.*)        
        (-\s)   
        (?P<user_name>\w*)  
        (\s) 
        ([POST]*)
        (?P<time>\w*)               
                 """
        for item in re.finditer(pattern,logdata,re.VERBOSE):
            print(item.groupdict())
        return(item)
logs()

It helped my in making "host" and "user_name" however I can't continue and making the rest of the requirements. can anyone help please? this is what i have done till now

score 5 · Answer 1 · answered Sep 29 '20 at 06:09

5

try this my friend

import re


def logs():
    logs = []
    w = '(?P<host>(?:\d+\.){3}\d+)\s+(?:\S+)\s+(?P<user_name>\S+)\s+\[(?P<time>[-+\w\s:/]+)\]\s+"(?P<request>.+?.+?)"'
    with open("assets/logdata.txt", "r") as f:
        logdata = f.read()
    for m in re.finditer(w, logdata):
        logs.append(m.groupdict())
    return logs

answered Sep 29 '20 at 06:09

Abd-elrhman Mohey

87
3

what do '.+?.+?' mean at the end please? – Bluetail Jan 16 '21 at 23:58
here you go my friend https://stackoverflow.com/questions/32792851/what-does-mean-in-regex – Abd-elrhman Mohey Feb 22 '21 at 17:49

n1colas.m · Answer 2 · 2020-09-29T01:45:18.440

You're using \w to get user_names, however \w doesn't include - that can be in the log (Common Log Format (CLF)), so as an alternative you could use \S+ (one or more of anything except a whitespace). For the time you can create a capturing group allowing only the expected characters (class) for that field (e.g. \w\s, -+ for the timezone, / for the date and : for the time) surrounded by squared brackets (literals), a similar capturing can be made for the request using ".

import re

regex = re.compile(
    r'(?P<host>(?:\d+\.){3}\d+)\s+'
    r'(?:\S+)\s+'
    r'(?P<user_name>\S+)\s+'
    r'\[(?P<time>[-+\w\s:/]+)\]\s+'
    r'"(?P<request>POST.+?)"'
)

def logs():
    data = []
    with open("sample.txt", "r") as f:
        logdata = f.read()
    for m in regex.finditer(logdata):
        data.append(m.groupdict())
    return data

print(logs())

(Replaced user_name from first line with "-" for testing on the second line)

[
   {
      "host":"146.204.224.152",
      "user_name":"feest6811",
      "time":"21/Jun/2019:15:45:24 -0700",
      "request":"POST /incentivize HTTP/l.l"
   },
   {
      "host":"146.204.224.152",
      "user_name":"-",
      "time":"21/Jun/2019:15:45:24 -0700",
      "request":"POST /incentivize HTTP/l.l"
   },
   {
      "host":"144.23.247.108",
      "user_name":"auer7552",
      "time":"21/Jun/2019:15:45:35 -0700",
      "request":"POST /extensible/infrastructures/one-to-one/enterprise HTTP/l.l"
   },
    ...

That's great it works, but if i want to specify request to print only request:"POST", what should i do? — Ahmed Sharshar, Sep 26 '20 at 17:07
@ahmed-sharshar You have to add `POST` to the last capturing group. I edited the answer. — n1colas.m, Sep 26 '20 at 17:31
It seems right, however, some mistake happened using this assert: ``` assert len(logs()) == 979 one_item={'host': '146.204.224.152', 'user_name': 'feest6811', 'time': '21/Jun/2019:15:45:24 -0700', 'request': 'POST /incentivize HTTP/1.1'} assert one_item in logs(), "Sorry, this item should be in the log results, check your formating" ``` — Ahmed Sharshar, Sep 26 '20 at 23:28
@ahmed-sharshar I have edited the question following [Common Log Format (CLF)](https://httpd.apache.org/docs/1.3/logs.html) description, see if works now. — n1colas.m, Sep 29 '20 at 01:55

score 1 · Answer 3 · answered Sep 29 '20 at 12:30

Please see the code below:

import re

regex = re.compile(
    r'(?P<host>(?:\d+\.){1,3}\d+)\s+-\s+'
    r'(?P<user_name>[\w+\-]+)?\s+'
    r'\[(?P<time>[-\w\s:/]+)\]\s+'
    r'"(?P<request>\w+.+?)"'
)

def logs():
    data = []
    with open("assets/logdata.txt", "r") as f:
        logdata = f.read()
        for item in regex.finditer(logdata):
            x = item.groupdict()
            if x["user_name"] is None:
                x["user_name"] = "-"
            data.append(x)
    return data

logs()

Please find below the part of output as well:

[{'host': '146.204.224.152', 'user_name': 'feest6811', 'time': '21/Jun/2019:15:45:24 -0700', 'request': 'POST /incentivize HTTP/1.1'}, {'host': '197.109.77.178', 'user_name': 'kertzmann3129', 'time': '21/Jun/2019:15:45:25 -0700', 'request': 'DELETE /virtual/solutions/target/web+services HTTP/2.0'}, {'host': '156.127.178.177', 'user_name': 'okuneva5222', 'time': '21/Jun/2019:15:45:27 -0700', 'request': 'DELETE /interactive/transparent/niches/revolutionize HTTP/1.1'}, {'host': '100.32.205.59', 'user_name': 'ortiz8891', 'time': '21/Jun/2019:15:45:28 -0700', 'request': 'PATCH /architectures HTTP/1.0'}, {'host': '168.95.156.240', 'user_name': 'stark2413', 'time': '21/Jun/2019:15:45:31 -0700', 'request': 'GET /engage HTTP/2.0'}, .....] with 979 dictionaries for each line of the text file.

Thank you

score 1 · Answer 4 · answered Jan 15 '21 at 16:31

1

import re
def logs():
mydata = []
with open("assets/logdata.txt", "r") as file:
logdata = file.read()
pattern="""
(?P<host>.*)
(\s+)
(?:\S+)
(\s+)
(?P<user_name>\S+)
(\s+)
\[(?P<time>.*)\]\
(\s)
(?P<request>"(.)*")"""
for item in re.finditer(pattern,logdata,re.VERBOSE):
new_item = (item.groupdict())
mydata.append(new_item)
return(mydata)

answered Jan 15 '21 at 16:31

Niloy Chatterjee

27
1

Welcome to Stack Overflow. Code dumps without any explanation are rarely helpful. Stack Overflow is about learning, not providing snippets to blindly copy and paste. Please [edit] your question and explain how it works better than what the OP provided. Also, indentation is important in Python and this code will generate an `IndentationError`. See [answer]. – ChrisGPT was on strike Jan 16 '21 at 01:14

score 0 · Answer 5 · answered Jan 03 '22 at 04:23

0

    import re
    def logs():
        with open("assets/logdata.txt", "r") as file:
            logdata = file.read()

    result = []
    pattern = re.compile(
    r'(?P<host>.*)\s'
    r'(?:-)\s'
    r'(?P<user_name>.*)\s'
    r'\[(?P<time>.*)\]\s'
    r'"(?P<request>.*)"')
    for m in pattern.finditer(logdata):
        data = (m.groupdict())
        result.append(data)
    return result

answered Jan 03 '22 at 04:23

Phaneesha Chilaveni

1

Could you please add a bit of explanation along with your code to describe why/how it works? – Anurag A S Jan 03 '22 at 04:28
1

Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Jan 03 '22 at 04:28

score 0 · Answer 6 · answered Apr 06 '23 at 05:23

0

import re
def logs():
with open("assets/logdata.txt", "r") as file:
    logdata = file.read()   
    lst = []
    for i in re.finditer(r'(?P<host>\S+) - (?P<user_name>\S+) \[(?P<time>.+)\] "(?P<request>.+)"', logdata):
        m = i.groupdict()
        lst.append(m)
    return lst
    raise NotImplementedError()

answered Apr 06 '23 at 05:23

Cristhian Ivan Cifuentes Guzma

1

Please add some explanation for your code rather than posting code only. Additional explanation will be more helpful. – user67275 Apr 06 '23 at 10:46

Vijayalakshmi Ramesh · Answer 7 · 2021-06-16T10:20:42.710

import re
def logs():
    dict=[]
    with open("assets/logdata.txt", "r") as file:
        logdata = file.read()
        pattern="""(?P<host>[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+)
                    (\ - \ )
                    (?P<user_name>(\w*)(\S))
                    (\  \S)
                    (?P<time>\d+\S\w*\S\d+\S\d+\S\d+\S\d+\s\S\d+)
                    (\S\s\S)
                    (?P<request>\w*\s\S*\s\w*\S\d.\d*)
                 """
    
    for item in re.finditer(pattern,logdata,re.VERBOSE): 
        dict.append(item.groupdict())
    return dict 
    raise NotImplementedError()

#1 import re module to use regex

#2 define function

#3 open the required file

#4 read the file

#5 write down the pattern of your required string. For more detailed information regarding regex, read the documentation. https://docs.python.org/3/library/re.html#module-re

Welcome to Stack Overflow. Code dumps without any explanation are rarely helpful. Stack Overflow is about learning, not providing snippets to blindly copy and paste. Please edit your question and explain how it works better than what the OP provided. See [How to Answer](https://stackoverflow.com/help/how-to-answer). — jrook, Jun 15 '21 at 21:02

Make dictionary from txt file using re

7 Answers7