Regex to use monthly.out for user accounting

Question

I'd like to use Python to analyse /var/log/monthly.out on OS X to export user accounting totals. The log file looks like this:

Mon Feb  1 09:12:41 GMT 2016

Rotating fax log files:

Doing login accounting:
    total      688.31
    example   401.12
    _mbsetupuser   287.10
    root         0.05
    admin     0.04

-- End of monthly output --

Tue Feb 16 14:27:21 GMT 2016

Rotating fax log files:

Doing login accounting:
    total        0.00

-- End of monthly output --

Thu Mar  3 09:37:31 GMT 2016

Rotating fax log files:

Doing login accounting:
    total      377.92
    example   377.92

-- End of monthly output --

I was able to extract the username / totals pairs with this regex:

\t(\w*)\W*(\d*\.\d{2})

In Python:

>>> import re
>>> re.findall(r'\t(\w*)\W*(\d*\.\d{2})', open('/var/log/monthly.out', 'r').read())
[('total', '688.31'), ('example', '401.12'), ('_mbsetupuser', '287.10'), ('root', '0.05'), ('admin', '0.04'), ('total', '0.00'), ('total', '377.92'), ('example', '377.92')]

But I can't figure out how to extract the date line in such a way where it's attached to the username / totals pairs for that month.

score 2 · Accepted Answer · edited May 23 '17 at 10:28

2

Use str.split().

import re

re_user_amount = r'\s+(\w+)\s+(\d*\.\d{2})'
re_date = r'\w{3}\s+\w{3}\s+\d+\s+\d\d:\d\d:\d\d \w+ \d{4}'

with open('/var/log/monthly.out', 'r') as f:
    content = f.read()
    sections = content.split('-- End of monthly output --')

    for section in sections:
        date = re.findall(re_date, section)
        matches = re.findall(re_user_amount, section)

        print(date, matches)

If you want to turn the date string into an actual datetime, check out Converting string into datetime.

edited May 23 '17 at 10:28

Community

1
1

answered Mar 07 '16 at 17:20

Tomalak

332,285
67
532
628

Ahh perfect thank you. I'd got fixated on doing it all in one regex for some reason :) – SillyWilly Mar 07 '16 at 17:29
Note that the date regex is quite a bit more specific than it needs to be for your input. You are free to dumb it down. – Tomalak Mar 07 '16 at 17:49

score 2 · Answer 2 · answered Mar 07 '16 at 17:31

Well, there's rarely a magical cure for everything based on regex. The regex are a great tool for simple string parsing, but it shall not replace good old programming!

So if you look at your data, you'll notice that it always start with a date, and ends with the -- End of monthly output -- line. So a nice way to handle that would be to split your data by each monthly output.

Let's start with your data:

>>> s = """\
... Mon Feb  1 09:12:41 GMT 2016
... 
... Rotating fax log files:
... 
... Doing login accounting:
...     total      688.31
...     example   401.12
...     _mbsetupuser   287.10
...     root         0.05
...     admin     0.04
... 
... -- End of monthly output --
... 
... Tue Feb 16 14:27:21 GMT 2016
... 
... Rotating fax log files:
... 
... Doing login accounting:
...     total        0.00
... 
... -- End of monthly output --
... 
... Thu Mar  3 09:37:31 GMT 2016
... 
... Rotating fax log files:
... 
... Doing login accounting:
...     total      377.92
...     example   377.92
... 
... -- End of monthly output --"""

And let's split it based ont that end of month line:

>>> reports = s.split('-- End of monthly output --')
>>> reports
['Mon Feb  1 09:12:41 GMT 2016\n\nRotating fax log files:\n\nDoing login accounting:\n    total      688.31\n    example   401.12\n    _mbsetupuser   287.10\n    root         0.05\n    admin     0.04\n\n', '\n\nTue Feb 16 14:27:21 GMT 2016\n\nRotating fax log files:\n\nDoing login accounting:\n    total        0.00\n\n', '\n\nThu Mar  3 09:37:31 GMT 2016\n\nRotating fax log files:\n\nDoing login accounting:\n    total      377.92\n    example   377.92\n\n', '']

Then you can separate the accounting data from the rest of the log:

>>> report = reports[0]
>>> head, tail = report.split('Doing login accounting:')

Now let's extract the date line:

>>> date_line = head.strip().split('\n')[0]

And fill up a dict with those username/totals pairs:

>>> accounting = dict(zip(tail.split()[::2], tail.split()[1::2]))

the trick here is to use zip() to create pairs out of iterators on tail. The "left" side of the pair being an iterator starting at index 0, iterating every 2 items, the ~right~ side of the pair being an iterator starting at index 1, iterating every 2 items. Which makes:

{'admin': '0.04', 'root': '0.05', 'total': '688.31', '_mbsetupuser': '287.10', 'example': '401.12'}

So now that's done, you can do that in a for loop:

import datetime

def parse_monthly_log(log_path='/var/log/monthly.out'):
    with open(log_path, 'r') as log:
        reports = log.read().strip('\n ').split('-- End of monthly output --')
        for report in filter(lambda it: it, reports):
            head, tail = report.split('Doing login accounting:')
            date_line = head.strip().split('\n')[0]
            accounting = dict(zip(tail.split()[::2], tail.split()[1::2]))
            yield {
                'date': datetime.datetime.strptime(date_line.replace('  ', ' 0'), '%a %b %d %H:%M:%S %Z %Y'),
                'accounting': accounting
            }

>>> import pprint
>>> pprint.pprint(list(parse_monthly_log()), indent=2)
[ { 'accounting': { '_mbsetupuser': '287.10',
                    'admin': '0.04',
                    'example': '401.12',
                    'root': '0.05',
                    'total': '688.31'},
    'date': datetime.datetime(2016, 2, 1, 9, 12, 41)},
{ 'accounting': { 'total': '0.00'},
    'date': datetime.datetime(2016, 2, 16, 14, 27, 21)},
{ 'accounting': { 'example': '377.92', 'total': '377.92'},
    'date': datetime.datetime(2016, 3, 3, 9, 37, 31)}]

And there you go with a pythonic solution without a single regex.

N.B.: I had to do a little trick with the datetime, because the log contains a day number filled with space and not zero (as expects strptime), I used string .replace() to change a double space into a 0 within the date string

N.B.: the filter() and the split() used in the for report… loop is used to remove leading and trailing empty reports, depending on how the log file starts or ends.

well, even though I love doing regex, I consider that it's not the right tool for everything. Here you could do a huge unreadable regex to parse your document, or you can use a few basic python operations and you're set. And the advantage of pure python, is that it makes your code a bit easier to understand (`dict(zip(tail.split()[::2], tail.split()[1::2]))` will always be easier to understand than `dict(re.findall("\n\s+(\w+)\s+([\d\.]+)", sec))`) and might be even more flexible for future changes. — zmo, Mar 07 '16 at 18:05
BTW, I did a profiling of both versions, it takes exactly the same time to parse the file as pure python or with regex. — zmo, Mar 07 '16 at 18:06
Since batch jobs like this one are hardly ever performance-critical, I would not have bothered measuring at all. :) To be absolutely honest, I consider `dict(zip(tail.split()[::2], tail.split()[1::2]))` a lot less readable than `re.findall(re_user_amount, section)` - partly because regex has the charming property that you can give complex operations a name like `re_user_amount`, partly because the string-slicing-zipping-dict-converting makes my brain hurt, whereas even the raw `\s+(\w+)\s+(\d*\.\d{2})` is crystal-clear to me. — Tomalak, Mar 07 '16 at 18:15
well, TBH, both are crystal clear to me. I still prefer to use pure pythonic stuff when it's easy enough to do, and with the `zip()` + index magic, it's just piece of cake ☺. And the `zip()` trick is a good one to know, as it's very useful and flexible in territories where regex are almost impossible to work out. — zmo, Mar 07 '16 at 18:18

score 0 · Answer 3 · answered Mar 07 '16 at 17:30

Here's something shorter:

with open("/var/log/monthly.out") as f:
    months = map(str.strip, f.read().split("-- End of monthly output --"))
    for sec in filter(None, y):
        date = sec.splitlines()[0]
        accs = re.findall("\n\s+(\w+)\s+([\d\.]+)", sec)
        print(date, accs)

This divides the file content into months, extracts the date of each month and searches for all accounts in each month.

score 0 · Answer 4 · answered Mar 07 '16 at 19:39

You may want to try the following regex, which is not so elegant though:

import re

string = """
Mon Feb  1 09:12:41 GMT 2016

Rotating fax log files:

Doing login accounting:
    total      688.31
    example   401.12
    _mbsetupuser   287.10
    root         0.05
    admin     0.04

-- End of monthly output --

Tue Feb 16 14:27:21 GMT 2016

Rotating fax log files:

Doing login accounting:
    total        0.00

-- End of monthly output --

Thu Mar  3 09:37:31 GMT 2016

Rotating fax log files:

Doing login accounting:
    total      377.92
    example   377.92

-- End of monthly output --
"""
pattern = '(\w+\s+\w+\s+[\d:\s]+[A-Z]{3}\s+\d{4})[\s\S]+?((?:\w+)\s+(?:[0-9.]+))\s+(?:((?:\w+)\s*(?:[0-9.]+)))?\s+(?:((?:\w+)\s*(?:[0-9.]+)))?\s*(?:((?:\w+)\s+(?:[0-9.]+)))?\s*(?:((?:\w+)\s*(?:[0-9.]+)))?'
print re.findall(pattern, string)

Output：

[('Mon Feb  1 09:12:41 GMT 2016', 'total      688.31', 'example   401.12', '_mbsetupuser   287.10', 'root         0.05', 'admin     0.04'), 
('Tue Feb 16 14:27:21 GMT 2016', 'total        0.00', '', '', '', ''), 
('Thu Mar  3 09:37:31 GMT 2016', 'total      377.92', 'example   377.92', '', '', '')]

REGEX DEMO.

Regex to use monthly.out for user accounting

4 Answers4