1

I'm trying to switch over to using python exclusively. Something that I have used pretty extensively in C# is LINQ. In this exercise the goal is to get a collection of key value pairs, the keys being each month and the value a count of the number of messages in that month, how can I do something like this with python or perhaps what would be a better way to do this?

class MainClass
{
    public static void Main (string[] args)
    {
        string[] months = { "jan", "feb", "mar", "apr", "may", "jun", "jul", "aug", "sep", "oct", "nov", "dec" };

        var log = LineReader ();
        Dictionary<string, int> cumulativeMonths = new Dictionary<string, int> ();

        months.ToList ()
            .ForEach (f => {
                cumulativeMonths.Add(f, log.GroupBy(g => g.Split(' ').First().ToLower())
                    .Where(w => w.Key == f).ToList().Count());

            });                                         
    }
    public static IEnumerable<string> LineReader()
    {
        Console.WriteLine ("Hello World!");
        using (StreamReader sr = new StreamReader (File.OpenRead ("/var/log/messages"))) {

            while (!sr.EndOfStream) {

                yield return sr.ReadLine ();
            }
        }
    }
}

Test Input:

Feb 18 02:51:36 laptop rsyslogd: [origin software="rsyslogd" swVersion="8.4.2" x-pid="2952" x-info="http://www.rsyslog.com"] start
Feb 18 02:51:36 laptop kernel: Adaptec aacraid driver 1.2-0[30300]-ms
Feb 18 02:51:36 laptop kernel: megaraid cmm: 2.20.2.7 (Release Date: Sun Jul 16 00:01:03 EST 2006)
Feb 18 02:51:36 laptop kernel: megaraid: 2.20.5.1 (Release Date: Thu Nov 16 15:32:35 EST 2006)
Feb 18 02:51:36 laptop kernel: megasas: 06.805.06.00-rc1 Thu. Sep. 4 17:00:00 PDT 2014
Feb 18 02:51:36 laptop kernel: qla2xxx [0000:00:00.0]-0005: : QLogic Fibre Channel HBA Driver: 8.07.00.16-k.
Feb 18 02:51:36 laptop kernel: Emulex LightPulse Fibre Channel SCSI driver 10.4.8000.0.
Feb 18 02:51:36 laptop kernel: Copyright(c) 2004-2014 Emulex.  All rights reserved.
Feb 18 02:51:36 laptop kernel: aic94xx: Adaptec aic94xx SAS/SATA driver version 1.0.3 loaded
Feb 18 02:51:36 laptop kernel: ACPI: bus type USB registered

Test Output would be a dictionary: {Jan: 64562, Feb: 38762} ....

4 Answers4

2

This is easier than you've done it, and very easy in Python:

with open('/var/log/messages', 'r') as f:
    cumulative_months = {}
    for line in f:
        key = line.split()[0].lower()
        cumulative_months[key] = cumulative_months.get(key, 0) + 1

with is similar to C#'s using and will close the file when it goes out of scope. The python file object can be used as an iterator. It will read and return a line at a time until it hits EOF. (It actually reads a little more than one line, see documentation).

Alternatively, as noted by m.wasowski, you can use the collections.Counter class for this type of task to make things even easier and faster.

bj0
  • 7,893
  • 5
  • 38
  • 49
  • Nice, I'll try this the only problem I'm thinking is also that I need to rely on a line at a time, because readlines() would read the whole file which could be a few gigs into memory – Paige Thompson May 08 '15 at 23:18
  • @PadraicCunningham interesting, I hadn't used it like that before, does it read the whole file or just a bit at a time? – bj0 May 08 '15 at 23:20
  • @PaigeThompson, you never need readlines in python unless you actually want a list, you can iterate over the file object getting a line at a time, a file object in python returns it's own iterator – Padraic Cunningham May 08 '15 at 23:20
  • To clarify, it iterates over one line at a time, but [there may be more than one line stored in the buffer](http://stackoverflow.com/questions/29133556/does-for-line-in-file-read-entire-file). – TigerhawkT3 May 08 '15 at 23:24
0

You can use a collections.Counter dict:

from collections import Counter
with open('yourfile') as f:
    count = Counter (line.split()[0] for line in f)

Sorry for any mistakes, it is written from mobile :)

Padraic Cunningham
  • 176,452
  • 29
  • 245
  • 321
m.wasowski
  • 6,329
  • 1
  • 23
  • 30
0

Yeah that is esesntially what I came up with myself, I guess I was wondering if there was a more elegant (one-linerish) approach to solving the problem:

fh = open("/var/log/messages", encoding = "ISO-8859-1")
fh.seek(0)
febMessages = [x for x in fh if x.split(' ')[0].lower() == 'feb']
len(febMessages)
0

Based on your own answer you can simply use sum and a generator expression if all you want is the count for february:

with open("/var/log/messages") as f:
    febMessages = sum(x.split(None, 1)[0].lower() == 'feb' for x in f)

x.split(None, 1) splits once on whitespace and extracts the first element, we generate a value at a time using the generator expression so we don't create a full list of elements to just throw away and sum sums all the times (x.split(None, 1)[0].lower() == 'feb' evaluates to True. You also don't need fh.seek(0), the pointer is at the start of the file when you open it.

If you actually want a count of all months stored then use a Counter dict.

Padraic Cunningham
  • 176,452
  • 29
  • 245
  • 321