4

I'm writing a program to parse through some log files. If an error code is in the line, I need to print the previous 25 lines for analysis. I'd like to be able to repeat this concept with more or less lines depending on the individual error code (instead of 25 lines, 15 or 35).

with open(file, 'r') as input:
     for line in input:
         if "error code" in line: 
             #print previous 25 lines

I know the equivalent command in Bash for what I need is grep "error code" -B 25 Filename | wc -1. I'm still new to python and programming in general, I know I'm going to need a for loop and I've tried using the range function to do this but I haven't had much luck because I don't know how to implement the range into files.`

pjano1
  • 143
  • 1
  • 3
  • 10

2 Answers2

7

This is a perfect use case for a length limited collections.deque:

from collections import deque

line_history = deque(maxlen=25)
with open(file) as input:
    for line in input:
        if "error code" in line: 
            print(*line_history, line, sep='')
            # Clear history so if two errors seen in close proximity, we don't
            # echo some lines twice
            line_history.clear()
        else:
            # When deque reaches 25 lines, will automatically evict oldest
            line_history.append(line)

Complete explanation of why I chose this approach (skip if you don't really care):

This isn't solvable in a good/safe way using for/range, because indexing only makes sense if you load the whole file into memory; the file on disk has no idea where lines begin and end, so you can't just ask for "line #357 of the file" without reading it from the beginning to find lines 1 through 356. You'd either end up repeatedly rereading the file, or slurping the whole file into an in-memory sequence (e.g. list/tuple) to have indexing make sense.

For a log file, you have to assume it could be quite large (I regularly deal with multi-gigabyte log files), to the point where loading it into memory would exhaust main memory, so slurping is a bad idea, and rereading the file from scratch each time you hit an error is almost as bad (it's slow, but it's reliably slow I guess?). The deque based approach means your peak memory usage is based on the 27 longest lines in the file, rather than the total file size.

A naïve solution with nothing but built-ins could be as simple as:

with open(file) as input:
    lines = tuple(input)  # Slurps all lines from file
for i, line in enumerate(lines):
    if "error code" in line:
        print(*lines[max(i-25, 0):i], line, sep='')

but like I said, this requires enough memory to hold your entire log file in memory at once, which is a bad thing to count on. It also repeats lines when two errors occur in close proximity, because unlike deque, you don't get an easy way to empty your recent memory; you'd have to manually track the index of the last print to restrict your slice.

Note that even then, I didn't use range; range is a crutch a lot of people coming from C backgrounds rely on, but it's usually the wrong way to solve a problem in Python. In cases where an index is needed (it usually isn't), you usually need the value too, so enumerate based solutions are superior; most of the time, you don't need an index at all, so direct iteration (or paired iteration with zip or the like) is the correct solution.

ShadowRanger
  • 143,180
  • 12
  • 188
  • 271
  • can we use `queue` here? – Van Peer Aug 24 '18 at 18:22
  • 1
    @VanPeer: `queue.Queue` is intended for hand-off between threads/processes. When you length limit a `queue.Queue`, it *blocks* when the limit is hit, it doesn't silently discard the oldest value; in this case, silently discarding the oldest value is a feature we want. `queue.Queue` also adds a ton of overhead that gains you nothing here; it's actually built on a `collections.deque` under the hood, but where `collections.deque` is implemented in C and lock-free, `queue.Queue` is implemented in Python, using fairly complex synchronization code. – ShadowRanger Aug 24 '18 at 18:25
  • Note: If you *want* to print lines twice for errors in close proximity, you'd just delete the `line_history.clear()` and `else:` lines, then dedent `line_history.append(line)` (so it's executed unconditionally). I went with clearing because I was following a design closer to the behavior of `fgrep -B25 'error code'` which this code roughly replicates. – ShadowRanger Aug 24 '18 at 18:29
  • thanks for the explanation! I did give it a try using `queue`. We've to write separate logic to evict when it reaches full. – Van Peer Aug 24 '18 at 18:49
  • @ShadowRanger, I know this is too late, while I was using your solution and just trying to find the way to remove the `newline` while using `print(*line_history, line, sep='')`, also would be great if you can tell a little about `*line_history` star char significance here, i tried to check it but unable to get it being a newbie learner. – user2023 Feb 21 '20 at 06:00
  • @kulfi: [What does ** (double star/asterisk) and * (star/asterisk) do for parameters?](https://stackoverflow.com/q/36901/364696) – ShadowRanger Feb 21 '20 at 11:58
  • @ShadowRanger, many thanks for pointing and sharing the link. – user2023 Feb 21 '20 at 16:10
1

Try base coding with for loop and range function without any special libraries:

N = 25
with open(file, 'r') as f:
    lines = f.read().splitlines()
    for i, line in enumerate(lines):
        if "error code" in line: 
            j = i-N if i>N else 0
            for k in range(j,i):
                print(lines[k])

Above prints previous 25 lines or from first line if total lines are less than 25.

Also, it is better to avoid using input as a variable term since it is a keyword in Python.

rnso
  • 23,686
  • 25
  • 112
  • 234
  • While this works for small log files, it has a peak memory requirement proportional to two times the total size of the log file (the line `input = f.read().splitlines()` must, however briefly, hold the entire file in memory as a `str`, as well as a `list` containing a `str` for every line in the file). For a 1 GB log file, you'd better have 2.7-12 GB of RAM (on top of whatever your OS, the Python interpreter, and all your other programs are using, exact amount depending on whether the text is ASCII, latin-1, BMP or non-BMP) available to hold it all, or you'll be stuck in swap thrashing hell. – ShadowRanger Aug 24 '18 at 18:59
  • That is a very good point, but I thought a beginner will be more interested in simple code than an industry-standard approach. – rnso Aug 24 '18 at 19:04