0

I'm trying to loop over a number of log-files and need to sort file entries (lines) across all files being looped

This is what I'm doing:

import glob

f = glob.glob('logs/')
for line in sorted(fileinput.input(f), key=stringsplit(line)):
  print line

So, I'm opening all files and then want to use the stringsplit function (which extracts a date from the file entry) as sorting criteria.

Problem is, doing this gives me an error saying:

name 'line' is not defined

Question:
Is it not possible to pass the line being loop-ed as parameter into a sorting function? How can this be done?

Thanks!

frequent
  • 27,643
  • 59
  • 181
  • 333

2 Answers2

2

try key=lambda line: stringsplit(line).

The sorting is done before you start iterating in the for-loop.

shx2
  • 61,779
  • 13
  • 130
  • 153
1

The key keyword must be a callable. It is called for every entry in the input sequence.

A lambda is an easy way to create such a callable:

sorted(..., key=lambda line: stringsplit(line))

I would be extremely wary of sorting the output of fileinput with many, large files though. sorted() must read all lines into memory to be able to sort them. If your files are many and / or large, you'll use up all memory, eventually leading to a MemoryError exception.

Use a different method to pre-sort your logs first. You can use a the UNIX tool sort, or use a external sorting technique instead.

If your input files are already sorted, you can merge them using the same key:

import operator

def mergeiter(*iterables, **kwargs):
    """Given a set of sorted iterables, yield the next value in merged order"""
    iterables = [iter(it) for it in iterables]
    iterables = {i: [next(it), i, it] for i, it in enumerate(iterables)}
    if 'key' not in kwargs:
        key = operator.itemgetter(0)
    else:
        key = lambda item, key=kwargs['key']: key(item[0])

    while True:
        value, i, it = min(iterables.values(), key=key)
        yield value
        try:
            iterables[i][0] = next(it)
        except StopIteration:
            del iterables[i]
            if not iterables:
                raise

then pass in your open file objects:

files = [open(f) for f in glob.glob('logs/*')]
for line in mergeiter(*files, key=lambda line: stringsplit(line)):
    # lines are looped over in merged order.

but you need to make certain that the stringsplit() function returns values as they are ordered in the input log files.

Community
  • 1
  • 1
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • good tip. I was going to check how much lines my memory can swallow ... :-) – frequent Mar 08 '13 at 15:49
  • mh. I need to sort by date-time. Seperate files are sorted by date-time, but I can only do a global sort to get items A-Z across all files, so merging sorted files won't help, will it? – frequent Mar 08 '13 at 15:55
  • 1
    @frequent: if your separate files are already sorted on date, then you can merge them. See the question I linked to, you could adapt the technique shown there to merge the files. – Martijn Pieters Mar 08 '13 at 15:56
  • 1
    @frequent: Updated my answer with an adaptation of my `mergeiter()` function. – Martijn Pieters Mar 08 '13 at 16:02
  • the lines are strings which include dates like '14/Nov/2012:13:12:23', stringsplit extracts the this string and converts it to sortable `2012-11-14:13:12:23`. Let's see if this works – frequent Mar 08 '13 at 16:06
  • @frequent: Sorry, the code was written off-the-cuff. I've moved the keyword argument to the correct location in the function definition. – Martijn Pieters Mar 08 '13 at 16:12
  • ok. sorry to bother again. Now I'm receiving: `mergeiter() got multiple values for keyword argument 'key'` – frequent Mar 08 '13 at 16:16
  • @frequent: Okay, a tested version, this time. :-) Sorry again! – Martijn Pieters Mar 08 '13 at 16:20
  • I just tried wrapping `key=lambda line: stringsplit(line)` in `(...)`, which will pass `key`... I assume. let's see :-) – frequent Mar 08 '13 at 16:27
  • 1
    @frequent: Apologies, noticed there was a `*` missing on the `mergeiter()` example call. Corrected that too. Multitasking and SO don't mix that well... :-) – Martijn Pieters Mar 08 '13 at 16:29
  • :-) I also added a `*` in `glob.glob('logs/*')]` which I forgot in my original code. Script is running, let's see what happens :-) – frequent Mar 08 '13 at 16:33