2

I'd like to use Python to analyse /var/log/monthly.out on OS X to export user accounting totals. The log file looks like this:

Mon Feb  1 09:12:41 GMT 2016

Rotating fax log files:

Doing login accounting:
    total      688.31
    example   401.12
    _mbsetupuser   287.10
    root         0.05
    admin     0.04

-- End of monthly output --

Tue Feb 16 14:27:21 GMT 2016

Rotating fax log files:

Doing login accounting:
    total        0.00

-- End of monthly output --

Thu Mar  3 09:37:31 GMT 2016

Rotating fax log files:

Doing login accounting:
    total      377.92
    example   377.92

-- End of monthly output --

I was able to extract the username / totals pairs with this regex:

\t(\w*)\W*(\d*\.\d{2})

In Python:

>>> import re
>>> re.findall(r'\t(\w*)\W*(\d*\.\d{2})', open('/var/log/monthly.out', 'r').read())
[('total', '688.31'), ('example', '401.12'), ('_mbsetupuser', '287.10'), ('root', '0.05'), ('admin', '0.04'), ('total', '0.00'), ('total', '377.92'), ('example', '377.92')]

But I can't figure out how to extract the date line in such a way where it's attached to the username / totals pairs for that month.

SillyWilly
  • 378
  • 1
  • 9

4 Answers4

2

Use str.split().

import re

re_user_amount = r'\s+(\w+)\s+(\d*\.\d{2})'
re_date = r'\w{3}\s+\w{3}\s+\d+\s+\d\d:\d\d:\d\d \w+ \d{4}'

with open('/var/log/monthly.out', 'r') as f:
    content = f.read()
    sections = content.split('-- End of monthly output --')

    for section in sections:
        date = re.findall(re_date, section)
        matches = re.findall(re_user_amount, section)

        print(date, matches)

If you want to turn the date string into an actual datetime, check out Converting string into datetime.

Community
  • 1
  • 1
Tomalak
  • 332,285
  • 67
  • 532
  • 628
  • Ahh perfect thank you. I'd got fixated on doing it all in one regex for some reason :) – SillyWilly Mar 07 '16 at 17:29
  • Note that the date regex is quite a bit more specific than it needs to be for your input. You are free to dumb it down. – Tomalak Mar 07 '16 at 17:49
2

Well, there's rarely a magical cure for everything based on regex. The regex are a great tool for simple string parsing, but it shall not replace good old programming!

So if you look at your data, you'll notice that it always start with a date, and ends with the -- End of monthly output -- line. So a nice way to handle that would be to split your data by each monthly output.

Let's start with your data:

>>> s = """\
... Mon Feb  1 09:12:41 GMT 2016
... 
... Rotating fax log files:
... 
... Doing login accounting:
...     total      688.31
...     example   401.12
...     _mbsetupuser   287.10
...     root         0.05
...     admin     0.04
... 
... -- End of monthly output --
... 
... Tue Feb 16 14:27:21 GMT 2016
... 
... Rotating fax log files:
... 
... Doing login accounting:
...     total        0.00
... 
... -- End of monthly output --
... 
... Thu Mar  3 09:37:31 GMT 2016
... 
... Rotating fax log files:
... 
... Doing login accounting:
...     total      377.92
...     example   377.92
... 
... -- End of monthly output --"""

And let's split it based ont that end of month line:

>>> reports = s.split('-- End of monthly output --')
>>> reports
['Mon Feb  1 09:12:41 GMT 2016\n\nRotating fax log files:\n\nDoing login accounting:\n    total      688.31\n    example   401.12\n    _mbsetupuser   287.10\n    root         0.05\n    admin     0.04\n\n', '\n\nTue Feb 16 14:27:21 GMT 2016\n\nRotating fax log files:\n\nDoing login accounting:\n    total        0.00\n\n', '\n\nThu Mar  3 09:37:31 GMT 2016\n\nRotating fax log files:\n\nDoing login accounting:\n    total      377.92\n    example   377.92\n\n', '']

Then you can separate the accounting data from the rest of the log:

>>> report = reports[0]
>>> head, tail = report.split('Doing login accounting:')

Now let's extract the date line:

>>> date_line = head.strip().split('\n')[0]

And fill up a dict with those username/totals pairs:

>>> accounting = dict(zip(tail.split()[::2], tail.split()[1::2]))

the trick here is to use zip() to create pairs out of iterators on tail. The "left" side of the pair being an iterator starting at index 0, iterating every 2 items, the ~right~ side of the pair being an iterator starting at index 1, iterating every 2 items. Which makes:

{'admin': '0.04', 'root': '0.05', 'total': '688.31', '_mbsetupuser': '287.10', 'example': '401.12'}

So now that's done, you can do that in a for loop:

import datetime

def parse_monthly_log(log_path='/var/log/monthly.out'):
    with open(log_path, 'r') as log:
        reports = log.read().strip('\n ').split('-- End of monthly output --')
        for report in filter(lambda it: it, reports):
            head, tail = report.split('Doing login accounting:')
            date_line = head.strip().split('\n')[0]
            accounting = dict(zip(tail.split()[::2], tail.split()[1::2]))
            yield {
                'date': datetime.datetime.strptime(date_line.replace('  ', ' 0'), '%a %b %d %H:%M:%S %Z %Y'),
                'accounting': accounting
            }

>>> import pprint
>>> pprint.pprint(list(parse_monthly_log()), indent=2)
[ { 'accounting': { '_mbsetupuser': '287.10',
                    'admin': '0.04',
                    'example': '401.12',
                    'root': '0.05',
                    'total': '688.31'},
    'date': datetime.datetime(2016, 2, 1, 9, 12, 41)},
{ 'accounting': { 'total': '0.00'},
    'date': datetime.datetime(2016, 2, 16, 14, 27, 21)},
{ 'accounting': { 'example': '377.92', 'total': '377.92'},
    'date': datetime.datetime(2016, 3, 3, 9, 37, 31)}]

And there you go with a pythonic solution without a single regex.

N.B.: I had to do a little trick with the datetime, because the log contains a day number filled with space and not zero (as expects strptime), I used string .replace() to change a double space into a 0 within the date string

N.B.: the filter() and the split() used in the for report… loop is used to remove leading and trailing empty reports, depending on how the log file starts or ends.

zmo
  • 24,463
  • 4
  • 54
  • 90
  • Wow, you are going out of your way to avoid regex. ;) – Tomalak Mar 07 '16 at 17:50
  • well, even though I love doing regex, I consider that it's not the right tool for everything. Here you could do a huge unreadable regex to parse your document, or you can use a few basic python operations and you're set. And the advantage of pure python, is that it makes your code a bit easier to understand (`dict(zip(tail.split()[::2], tail.split()[1::2]))` will always be easier to understand than `dict(re.findall("\n\s+(\w+)\s+([\d\.]+)", sec))`) and might be even more flexible for future changes. – zmo Mar 07 '16 at 18:05
  • 1
    BTW, I did a profiling of both versions, it takes exactly the same time to parse the file as pure python or with regex. – zmo Mar 07 '16 at 18:06
  • 1
    Since batch jobs like this one are hardly ever performance-critical, I would not have bothered measuring at all. :) To be absolutely honest, I consider `dict(zip(tail.split()[::2], tail.split()[1::2]))` a lot less readable than `re.findall(re_user_amount, section)` - partly because regex has the charming property that you can give complex operations a name like `re_user_amount`, partly because the string-slicing-zipping-dict-converting makes my brain hurt, whereas even the raw `\s+(\w+)\s+(\d*\.\d{2})` is crystal-clear to me. – Tomalak Mar 07 '16 at 18:15
  • 1
    well, TBH, both are crystal clear to me. I still prefer to use pure pythonic stuff when it's easy enough to do, and with the `zip()` + index magic, it's just piece of cake ☺. And the `zip()` trick is a good one to know, as it's very useful and flexible in territories where regex are almost impossible to work out. – zmo Mar 07 '16 at 18:18
0

Here's something shorter:

with open("/var/log/monthly.out") as f:
    months = map(str.strip, f.read().split("-- End of monthly output --"))
    for sec in filter(None, y):
        date = sec.splitlines()[0]
        accs = re.findall("\n\s+(\w+)\s+([\d\.]+)", sec)
        print(date, accs)

This divides the file content into months, extracts the date of each month and searches for all accounts in each month.

Zach Gates
  • 4,045
  • 1
  • 27
  • 51
0

You may want to try the following regex, which is not so elegant though:

import re

string = """
Mon Feb  1 09:12:41 GMT 2016

Rotating fax log files:

Doing login accounting:
    total      688.31
    example   401.12
    _mbsetupuser   287.10
    root         0.05
    admin     0.04

-- End of monthly output --

Tue Feb 16 14:27:21 GMT 2016

Rotating fax log files:

Doing login accounting:
    total        0.00

-- End of monthly output --

Thu Mar  3 09:37:31 GMT 2016

Rotating fax log files:

Doing login accounting:
    total      377.92
    example   377.92

-- End of monthly output --
"""
pattern = '(\w+\s+\w+\s+[\d:\s]+[A-Z]{3}\s+\d{4})[\s\S]+?((?:\w+)\s+(?:[0-9.]+))\s+(?:((?:\w+)\s*(?:[0-9.]+)))?\s+(?:((?:\w+)\s*(?:[0-9.]+)))?\s*(?:((?:\w+)\s+(?:[0-9.]+)))?\s*(?:((?:\w+)\s*(?:[0-9.]+)))?'
print re.findall(pattern, string)

Output:

[('Mon Feb  1 09:12:41 GMT 2016', 'total      688.31', 'example   401.12', '_mbsetupuser   287.10', 'root         0.05', 'admin     0.04'), 
('Tue Feb 16 14:27:21 GMT 2016', 'total        0.00', '', '', '', ''), 
('Thu Mar  3 09:37:31 GMT 2016', 'total      377.92', 'example   377.92', '', '', '')]

REGEX DEMO.

Quinn
  • 4,394
  • 2
  • 21
  • 19