4

I'm taking this course on Coursera, and I'm running some issues while doing the first assignment. The task is to basically use regular expression to get certain values from the given file. Then, the function should output a dictionary containing these values:

example_dict = {"host":"146.204.224.152", 

                "user_name":"feest6811", 

                "time":"21/Jun/2019:15:45:24 -0700",

                "request":"POST /incentivize HTTP/1.1"} 

This is just a screenshot of the file. Due to some reasons, the link doesn't work if it's not open directly from Coursera. I apologize in advance for the bad formatting. One thing I must point out is that for some cases, as you can see in the first example, there's no username. Instead '-' is used.

159.253.153.40 - - [21/Jun/2019:15:46:10 -0700] "POST /e-business HTTP/1.0" 504 19845
136.195.158.6 - feeney9464 [21/Jun/2019:15:46:11 -0700] "HEAD /open-source/markets HTTP/2.0" 204 21149 

This is what I currently have right now. However, the output is None. I guess there's something wrong in my pattern.

import re
def logs():
    
    with open("assets/logdata.txt", "r") as file:
        logdata = file.read()
    # YOUR CODE HERE
        
        pattern = """ 
        (?P<host>\w*)
        (\d+\.\d+.\d+.\d+\ )
        (?P<user_name>\w*)
        (\ -\ [a-z]+[0-9]+\ )
        (?P<time>\w*)
        (\[(.*?)\])
        (?P<request>\w*)
        (".*")
        """
        for item in re.finditer(pattern,logdata,re.VERBOSE):
       
            print(item.groupdict())
Dharman
  • 30,962
  • 25
  • 85
  • 135
BryantHsiung
  • 51
  • 1
  • 1
  • 6

2 Answers2

5

You can use the following expression:

(?P<host>\d+(?:\.\d+){3}) # 1+ digits and 3 occurrenses of . and 3 digits
\s+\S+\s+                 # 1+ whitespaces, 1+ non-whitespaces, 1+ whitespaces
(?P<user_name>\S+)\s+\[   # 1+ non-whitespaces (Group "user_name"), 1+ whitespaces and [
(?P<time>[^\]\[]*)\]\s+   # Group "time": 0+ chars other than [ and ], ], 1+ whitespaces
"(?P<request>[^"]*)"      # ", Group "request": 0+ non-" chars, "

See the regex demo. See the Python demo:

import re
logdata = r"""159.253.153.40 - - [21/Jun/2019:15:46:10 -0700] "POST /e-business HTTP/1.0" 504 19845
136.195.158.6 - feeney9464 [21/Jun/2019:15:46:11 -0700] "HEAD /open-source/markets HTTP/2.0" 204 21149"""
pattern = r'''
(?P<host>\d+(?:\.\d+){3}) # 1+ digits and 3 occurrenses of . and 3 digits
\s+\S+\s+                 # 1+ whitespaces, 1+ non-whitespaces, 1+ whitespaces
(?P<user_name>\S+)\s+\[   # 1+ non-whitespaces (Group "user_name"), 1+ whitespaces and [
(?P<time>[^\]\[]*)\]\s+   # Group "time": 0+ chars other than [ and ], ], 1+ whitespaces
"(?P<request>[^"]*)"      # ", Group "request": 0+ non-" chars, "
'''
for item in re.finditer(pattern,logdata,re.VERBOSE):
    print(item.groupdict())

Output:

{'host': '159.253.153.40', 'user_name': '-', 'time': '21/Jun/2019:15:46:10 -0700', 'request': 'POST /e-business HTTP/1.0'}
{'host': '136.195.158.6', 'user_name': 'feeney9464', 'time': '21/Jun/2019:15:46:11 -0700', 'request': 'HEAD /open-source/markets HTTP/2.0'}
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Thank you so much!!! It worked!!! However, may I just ask a question regarding your solution? It probably sounds stupid, but don't you need to include everything in the parenthesis? For example, ("?P[^"]*"). Or are they the same? Also, may you please explain the meaning of "?:" in your regular expression – BryantHsiung Oct 19 '20 at 13:20
  • @BryantHsiung You can't use `("?P[^"]*")`, it is an invalid regex construct. See more about [non-capturing groups here](https://stackoverflow.com/questions/3512471/what-is-a-non-capturing-group-in-regular-expressions). – Wiktor Stribiżew Oct 19 '20 at 13:42
  • 1
    Just did! Thanks again! – BryantHsiung Oct 20 '20 at 12:42
  • 1
    I am working on the same question but I don't know why my for loop doesn't give me an output! I check my regex pattern on regex101 and they are all seem to be working the way they should. – Anoushiravan R Jan 09 '22 at 20:22
  • @AnoushiravanR Without seeing your code, I can't help. – Wiktor Stribiżew Jan 09 '22 at 20:29
  • Ok let me try a bit and I will fix. I think I also have to account for all the characters I don't want to capture too. I only defined those I want to capture. – Anoushiravan R Jan 09 '22 at 20:31
1
import re
def names():
    simple_string = """Amy is 5 years old, and her sister Mary is 2 years old. Ruth and Peter, their parents, have 3 kids."""

    # YOUR CODE HERE
    p=re.findall('[A-Z][a-z]*',simple_string)
    return p

    #raise NotImplementedError()

Check using following code:

assert len(names()) == 4, "There are four names in the simple_string"

For more information regarding regex, read the following documentation, it would be very useful for beginners: https://docs.python.org/3/library/re.html#module-re