Python parse GET|POST path from access logs

Question

Suppose we have some access logs like this

83.198.250.175 - - [22/Mar/2009:07:40:06 +0100] "GET /images/ht1.gif HTTP/1.1" 200 61 "http://www.facades.fr/" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Wanadoo 6.7; Orange 8.0)" "-"

65.33.94.190 - - [05/Apr/2003:17:26:27 -0500] "POST /samples/dem/tt.php?x=e2323 HTTP/1.0" 404 276

151.227.152.48 - - [02/Jul/2014:14:35:55 +0100] "GET /css/main.css HTTP/1.1" 200 4658 "http://stanmore.menczykowski.co.uk/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36"

10.143.2.119 64.103.161.112 - [06/Jan/1970:00:48:01 +0000] "GET /right_arrow.jpg HTTP/1.1" 304 0 "http://64.103.161.112/index_eth_diag.html" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.95 Safari/537.36"

I need to get the bolded text parts after POST and GET (path to files).
the log format may be vary but the request type and path will always exist.

I tried to with the following but It didn't always work because the log format not the same

parts = [
    r'(?P<host>\S+)',                   # host %h
    r'\S+',                             # indent %l (unused)
    r'(?P<user>\S+)',                   # user %u
    r'\[(?P<time>.+)\]',                # time %t
    r'"(?P<request>.*)"',               # request "%r"
    r'(?P<status>[0-9]+)',              # status %>s
    r'(?P<size>\S+)',                   # size %b (careful, can be '-')
    r'"(?P<referrer>.*)"',              # referrer "%{Referer}i"
    r'"(?P<agent>.*)"',                 # user agent "%{User-agent}i"
]

def get_structured_access_logs_list(access_logs):
    pattern = re.compile(r'\s+'.join(parts) + r'\s*\Z')

    # Initialize required variables
    log_data = []

    # Get components from each line of the log file into a structured dict
    for line in access_logs:
        try:
            log_data.append(pattern.match(line).groupdict())
        except:
            pass
    return log_data

def parse_path(request_string) :
    rx = re.compile(r'^(?:GET|POST)\s+([^?\s]+).*$', re.M)
    return rx.findall(request_string)


def get_file_paths(access_logs_list):
    file_path_set = set()
    for dict in access_logs_list:
        if 'request' in dict.keys():
            file_name = parse_path(dict['request'])[0] # passing a single line, the list will contain only 1 element
            if file_name is not None:
                file_path_set.add(full_path)
    return accessed_file_set

UPDATE: after adjusting the code, the function 'get_file_paths' will return a set contains full path to files accessed in access logs

def parse_path(request_string) :
    rx = re.compile(r'"(?:GET|POST)\s+([^\s?]*)', re.M)
    return rx.findall(request_string)


def get_file_paths(access_logs):
    file_set = set()
    for line in access_logs:
            matches = parse_accessed_file_name_list(line) # passing a single line, the list will contain only 1 element
            if matches is None or len(matches) <= 0:
                continue
            full_path = root_path + matches[0]
            if os.path.isfile(full_path):
                file_set.add(full_path)
    return file_set

See the full solution [here](https://stackoverflow.com/a/55491666/3832970). — Wiktor Stribiżew, Apr 03 '19 at 09:34

Wiktor Stribiżew · Answer 1 · 2019-04-03T09:32:21.567

You may use

(?x)^
    (?P<host>\S+)                         \s+         # host %h
    \S+                                   \s+         # indent %l (unused)
    (?P<user>\S+)                         \s+         # user %u
    \[(?P<time>.*?)\]                     \s+         # time %t
    "\S+\s+(?P<request>[^"?\s]*)[^"]*"    \s+         # request "%r"
    (?P<status>[0-9]+)                    \s+         # status %>s
    (?P<size>\S+)                      (?:\s+         # size %b (careful, can be '-')
    "(?P<referrer>[^"?\s]*[^"]*)"         \s+         # referrer "%{Referer}i"
    "(?P<agent>[^"]*)"                 (?:\s+         # user agent "%{User-agent}i"
    "[^"]*"                            )? )?          # unused
$

See the regex demo.

There are a lot of minor improvements I introduced (see [^"]* instead of .*), the major ones are optional non-capturing groups to match referrer and agent fields that may be missing and the request pattern that looks like (?P<request>[^"?\s]*) and only captures 0 or more chars other than whitespace, ? and " char,while the subsequent [^"]*" matches the rest of the field.

Also, it makes sense to compile the pattern once, not as you do it when processing each line.

The (?x) modifier enables the free spacing mode making it possible to format the pattern on multiple lines and add comments.

Python demo:

import re
pattern = re.compile(r"""(?x)^
    (?P<host>\S+)                         \s+         # host %h
    \S+                                   \s+         # indent %l (unused)
    (?P<user>\S+)                         \s+         # user %u
    \[(?P<time>.*?)\]                     \s+         # time %t
    "\S+\s+(?P<request>[^"?\s]*)[^"]*"    \s+         # request "%r"
    (?P<status>[0-9]+)                    \s+         # status %>s
    (?P<size>\S+)                      (?:\s+          # size %b (careful, can be '-')
    "(?P<referrer>[^"?\s]*[^"]*)"         \s+         # referrer "%{Referer}i"
    "(?P<agent>[^"]*)"                 (?:\s+         # user agent "%{User-agent}i"
    "[^"]*"                           )?)?            # optional argument (unused)
$""")

def get_structured_access_logs_list(access_logs):
    # Initialize required variables
    log_data = []
    # Get components from each line of the log file into a structured dict
    for line in access_logs:
        try:
            log_data.append(pattern.match(line).groupdict())
        except:
            pass
    return log_data

lines = ['83.198.250.175 - - [22/Mar/2009:07:40:06 +0100] "GET /images/ht1.gif HTTP/1.1" 200 61 "http://www.facades.fr/" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Wanadoo 6.7; Orange 8.0)" "-"',
'65.33.94.190 - - [05/Apr/2003:17:26:27 -0500] "POST /samples/dem/tt.php?x=e2323 HTTP/1.0" 404 276',
'151.227.152.48 - - [02/Jul/2014:14:35:55 +0100] "GET /css/main.css HTTP/1.1" 200 4658 "http://stanmore.menczykowski.co.uk/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36"',
'10.143.2.119 64.103.161.112 - [06/Jan/1970:00:48:01 +0000] "GET /right_arrow.jpg HTTP/1.1" 304 0 "http://64.103.161.112/index_eth_diag.html" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.95 Safari/537.36"']
for res in get_structured_access_logs_list(lines):
    print(res)

@Salama Note that your `r'"(?P.*)"'` and `r'"(?P.*)"'` yield wrong output if you have 3 or more `"..."` fields at the end of the line. You cannot use your approach with dynamic pattern building easily as you have optional fields, thus, I think the free-spacing regex with comments is a more flexible approach here that helps keep the same level of readability. — Wiktor Stribiżew, Apr 03 '19 at 09:41

Pushpesh Kumar Rajwanshi · Answer 2 · 2019-04-03T09:16:26.827

1

You can use this regex and get the path from group1,

^.*?"(?:GET|POST) ([^\s?]+)

Demo

edited Apr 03 '19 at 09:16

answered Apr 03 '19 at 09:10

Pushpesh Kumar Rajwanshi

18,127
2
19
36

score 1 · Accepted Answer · answered Apr 03 '19 at 09:11

1

Since your regex is very generic (you use \S and . that are very broad), why don't you use directly:

"(?:GET|POST)\s+([^\s?]*)

[^\s?] matches all the characters that are not spaces nor question marks.

See here a demo.

answered Apr 03 '19 at 09:11

logi-kal

7,107
6
31
43

Python parse GET|POST path from access logs

3 Answers3

Linked