5

I'm setting up an ETL process for my company's S3 buckets so we can track our usage, and I've run into some trouble breaking up the columns of the S3 log file because Amazon uses spaces, double quotes, and square brackets to delimit columns.

I found this Regex: [^\\s\"']+|\"([^\"]*)\"|'([^']*)' on this SO post: Regex for splitting a string using space when not surrounded by single or double quotes and that's gotten me pretty close. I just need help adjusting it to ignore single quotes and also ignore spaces between a "[" and a "]"

Here's an example line from one of our files:

dd8d30dd085515d73b318a83f4946b26d49294a95030e4a7919de0ba6654c362 ourbucket.name.config [31/Oct/2011:17:00:04 +0000] 184.191.213.218 - 013259AC1A20DF37 REST.GET.OBJECT ourbucket.name.config.txt "GET /ourbucket.name.config.txt HTTP/1.1" 200 - 325 325 16 16 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6" -

And here's the format definition: http://s3browser.com/amazon-s3-bucket-logging-server-access-logs.php

Any help would be appreciated!

EDIT: in response to FaileDev, the output should be any string contained between two square brackets, e.g. [foo bar], two quotes, e.g. "foo bar" or spaces, e.g. foo bar (where both foo and bar would match individually. I've broken each match in the example line I provided into it's own line in the following block:

dd8d30dd085515d73b318a83f4946b26d49294a95030e4a7919de0ba6654c362 
ourbucket.name.config 
[31/Oct/2011:17:00:04 +0000] 
184.191.213.218 
- 
013259AC1A20DF37 
REST.GET.OBJECT 
ourbucket.name.config.txt 
"GET /ourbucket.name.config.txt HTTP/1.1" 
200 
- 
325 
325 
16 
16 
"-" 
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6" 
-
Community
  • 1
  • 1
Ben M.
  • 430
  • 2
  • 11

7 Answers7

4

Here is a dumb regex I wrote to parse s3 log files in node:

/^(.*?)\s(.*?)\s(\[.*?\])\s(.*?)\s(.*?)\s(.*?)\s(.*?)\s(.*?)\s(\".*?\")\s(.*?)\s(.*?)\s(.*?)\s(.*?)\s(.*?)\s(.*?)\s(\".*?\")\s(\".*?\")\s(.*?)$/

As I said, this is "dumb" - it relies heavily on them not changing the log format, and each field not containing any weird characters.

2

You can't do it using string.split, you need to iterate through all captures of the 'column' group (if you're using C#)

This matches a non-quoted, non-bracketed field: [^\s\"\[\]]+
This matches a bracketed field: \[[^\]\[]+\] 
This matches a quoted field: \"[^\"]+\"

It's easiest to leave the quotes and brackets on during matching, then strip them off using Trim('[','\','"').

@"^((?<column>[^\s\"\[\]]+|\[[^\]\[]+\]|\"[^\"]+\")\s+)+$"
Lilith River
  • 16,204
  • 2
  • 44
  • 76
  • Thanks, the ORing patterns worked fine. This string pattern works best for C# : @"([^\s\""\[\]]+)|(\[[^\]\[]+\])|(\""[^\""]+\"")" – Ben M. Nov 01 '11 at 14:20
  • Thanks. It seems stack overflow removed my slashes.... I forgot to embed it in a code block. Updating now. – Lilith River Nov 02 '11 at 07:09
1

This is a python solution that may help someone. It also removes the quotes and square brackets for you:

import re
log = '79a59df900b949e55d96a1e698fbacedfd6e09d98eacf8f8d5218e7cd47ef2be mybucket [06/Feb/2014:00:00:38 +0000] 192.0.2.3 79a59df900b949e55d96a1e698fbacedfd6e09d98eacf8f8d5218e7cd47ef2be A1206F460EXAMPLE REST.GET.BUCKETPOLICY - "GET /mybucket?policy HTTP/1.1" 404 NoSuchBucketPolicy 297 - 38 - "-" "S3Console/0.4" -'

regex = '(?:"([^"]+)")|(?:\[([^\]]+)\])|([^ ]+)'

# Result is a list of triples, with only one having a value
# (due to the three group types: '""' or '[]' or '')
result = re.compile(regex).findall(log)
for a, b, c in result:
    print(a or b or c)

Output:

79a59df900b949e55d96a1e698fbacedfd6e09d98eacf8f8d5218e7cd47ef2be
mybucket
06/Feb/2014:00:00:38 +0000
192.0.2.3
79a59df900b949e55d96a1e698fbacedfd6e09d98eacf8f8d5218e7cd47ef2be
A1206F460EXAMPLE
REST.GET.BUCKETPOLICY
-
GET /mybucket?policy HTTP/1.1
404
NoSuchBucketPolicy
297
-
38
-
-
S3Console/0.4
-
jon@jon-laptop:~/Downloads$ python regex.py
79a59df900b949e55d96a1e698fbacedfd6e09d98eacf8f8d5218e7cd47ef2be
mybucket
06/Feb/2014:00:00:38 +0000
192.0.2.3
79a59df900b949e55d96a1e698fbacedfd6e09d98eacf8f8d5218e7cd47ef2be
A1206F460EXAMPLE
REST.GET.BUCKETPOLICY
-
GET /mybucket?policy HTTP/1.1
404
NoSuchBucketPolicy
297
-
38
-
-
S3Console/0.4
-
Jon
  • 1,084
  • 11
  • 23
1

I agree with @andy! I can't believe more people aren't dealing with S3's access logs, considering how long they have been around.


This is the regexp I used

/(?:([a-z0-9]+)|-) (?:([a-z0-9\.-_]+)|-) (?:\[([^\]]+)\]|-) (?:([0-9\.]+)|-) (?:([a-z0-9]+)|-) (?:([a-z0-9.-_]+)|-) (?:([a-z\.]+)|-) (?:([a-z0-9\.-_\/]+)|-) (?:"-"|"([^"]+)"|-) (?:(\d+)|-) (?:([a-z]+)|-) (?:(\d+)|-) (?:(\d+)|-) (?:(\d+)|-) (?:(\d+)|-) (?:"-"|"([^"]+)"|-) (?:"-"|"([^"]+)"|-) (?:([a-z0-9]+)|-)/i

If you are using node.js you can utilize my module to make this much easier to deal with, or port it to C#, the basic ideas are all there.

https://github.com/icodeforlove/s3-access-log-parser

Chad Cache
  • 9,668
  • 3
  • 56
  • 48
0

I tried using this in C# but found there were some incorrect characters in the answer above and you had to have the regex for the non-quoted, non-bracketed field at the end otherwise it matches everything (using http://regexstorm.net/tester): enter image description here

The full regex with the bracketed field first, the quoted field second and the non-quoted, non-bracketed field last: enter image description here

A simple C# implementation:

    MatchCollection matches = Regex.Matches(contents, @"(\[[^\]\[]+\])|(""[^""]+"")|([^\s""\[\]]+)");
    for (int i = 0; i < matches.Count; i++)
    {
        Console.WriteLine(i + ": " + matches[i].ToString().Trim('[', ']', '"'));
    }
parsley72
  • 8,449
  • 8
  • 65
  • 98
0

Here is the regex I copied from AWS Knowledge Center and modify it a bit to make it working in ASP.NET Core.

new Regex("([^ ]*) ([^ ]*) \\[(.*?)\\] ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) (\"[^\"]*\"|-) (-|[0-9]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) (\"[^\"]*\"|-) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*)");

It is working fine for us. And if anyone wants to use c# class to store the access log, below is the code to parse each line of the log file and create S3ServerAccessLog object for it.

private List<S3ServerAccessLog> ParseLogs(string accessLogs)
{
    // split log file per new line since each log will be on a single line.
    var splittedLogs = accessLogs.Split("\r\n", StringSplitOptions.RemoveEmptyEntries);
    var parsedLogs = new List<S3ServerAccessLog>();

    foreach (var logLine in splittedLogs)
    {
        var parsedLog = ACCESS_LOG_REGEX.Split(logLine).Where(s => s.Length > 0).ToList();
                
        // construct 
        var logModel = new S3ServerAccessLog
        {
            BucketOwner = parsedLog[0],
            BucketName = parsedLog[1],
            RequestDateTime = DateTimeOffset.ParseExact(parsedLog[2], "dd/MMM/yyyy:HH:mm:ss K", CultureInfo.InvariantCulture),
            RemoteIP = parsedLog[3],
            Requester = parsedLog[4],
            RequestId = parsedLog[5],
            Operation = parsedLog[6],
            Key = parsedLog[7],
            RequestUri = parsedLog[8].Replace("\"", ""),
            HttpStatus = int.Parse(parsedLog[9]),
            ErrorCode = parsedLog[10],
            BytesSent = parsedLog[11],
            ObjectSize = parsedLog[12],
            TotalTime = parsedLog[13],
            TurnAroundTime = parsedLog[14],
            Referrer = parsedLog[15].Replace("\"", ""),
            UserAgent = parsedLog[16].Replace("\"", ""),
            VersionId = parsedLog[17],
            HostId = parsedLog[18],
            Sigv = parsedLog[19],
            CipherSuite = parsedLog[20],
            AuthType = parsedLog[21],
            EndPoint = parsedLog[22],
            TlsVersion = parsedLog[23]
        };

        parsedLogs.Add(logModel);
    }

    return parsedLogs;
}
Khizar Iqbal
  • 435
  • 2
  • 11
  • 18
0

I wasn't able to get any of the posted solutions to parse a log file entry that has a request URI containing double quotes, so this is what I ended up with in Python:

import json
import re
from collections import namedtuple

FILENAME = '/tmp/2022-11/2022-11-01-20-21-34-AB64DC3459FF2F2B'

# define a named tuple to represent each log entry
LogEntry = namedtuple(
    'LogEntry',
    [
        'bucket_owner',
        'bucket',
        'timestamp',
        'remote_ip',
        'requester',
        'request_id',
        'operation',
        's3_key',
        'request_uri',
        'http_version',
        'status_code',
        'error_code',
        'bytes_sent',
        'object_size',
        'total_time',
        'turn_around_time',
        'referrer',
        'user_agent',
        'version_id',
        'host_id',
        'sigv',
        'cipher_suite',
        'auth_type',
        'endpoint',
        'tls_version',
        'access_point_arn'
    ]
)

# compile the regular expression for parsing log entries
LOG_ENTRY_PATTERN = re.compile(
    r'(\S+) (\S+) \[(.+)\] (\S+) (\S+) (\S+) (\S+) (\S+) "(.*) HTTP\/(\d\.\d)" (\S+) (\S+) (\S+) (\S+) (\S+) (\S+) "(\S+)" "(.*)" (\S+) (\S+) (\S+) (\S+) (\S+) (\S+) (\S+) (\S+)'
)

# open the access log file
with open(FILENAME, 'r') as f:
    # iterate over each line in the file
    for line in f:
        # ignore certain types of operations
        if 'BATCH.DELETE.OBJECT' not in line \
                and 'S3.TRANSITION_SIA.OBJECT' not in line \
                and 'REST.COPY.OBJECT_GET' not in line:
            # parse the log entry using the regular expression
            match = LOG_ENTRY_PATTERN.match(line)

            if match:
                # create a LogEntry named tuple from the parsed log entry
                log_entry = LogEntry(*match.groups())
                log_entry = dict(log_entry._asdict())

                for key in log_entry:
                    if log_entry[key] == '-':
                        log_entry[key] = None

                print(json.dumps(log_entry, indent=4, default=str))

I personally find it cleaner working with a namedtuple which I then cast to a dict to easily insert into a MySQL database, than working with a list.

Ashley Kleynhans
  • 354
  • 2
  • 13