trim big log file

Question

i perform performance tests for a few java applications. Applications produce very big log files( it can be 7-10 GB) during the test . I need to trim these log files between specific dates and time. currently, i use python script, which parses log timestamps in datetime python object and print only matched strings. But this solution is very slow. 5 GB log is parsed about 25 minutes Obviously entries in log file is sequentially and i don't need to read all file line by line. I thought about reading file from the start and from the end, until condition is matched and print files between matched number of lines. But i don't know how can i read file from the backwards, without downloading it to the memory.

Please, can you suggest me any suitibale solution for this case.

here is part of python script:

      lfmt = '%Y-%m-%d %H:%M:%S'
      file = open(filename, 'rU')
      normal_line = ''
      for line in file:
        if line[0] == '[':
          ltimestamp = datetime.strptime(line[1:20], lfmt)

          if ltimestamp >= str and ltimestamp <= end:
            normal_line = 'True'
        else:
          normal_line = ''

      if normal_line:
        print line,

Ignoring kev's joke, you should really look at your logfile rotate policy. It's bad practice to let any logfiles get this large. — TerryE, Mar 11 '12 at 11:00
The `logrotate` utility is the way to go here - it exists to prevent this kind of thing from happening. — Daenyth, Mar 11 '12 at 17:16
Hey, I updated my answer below to include code to do what you're after, it is extremely fast! It should require only very minor modification. Let me know how it goes/ if its any use, and please accept/upvote my answer if it is, cheers. — fraxel, Mar 13 '12 at 10:57
ok, i want to compare different solutions, but i don't have enought time now. thank you for your answer. — Nikolai Golub, Mar 13 '12 at 13:15

score 5 · Accepted Answer · edited Dec 25 '13 at 16:24

As the data is sequential if the start and end of the region of interest are near the beginning of the file, then reading from the end of the file (to find the matching end point) is still a bad solution!

I've written some code that will quickly find the start and end points as you require, this approach is called binary search and is similar to the clasic childrens "higher or lower" guessing game!

The script reads a trial line mid-way between lower_bounds and upper_bounds (initially the SOF and EOF), and checks the match criteria. If the sought line is earlier, then it guesses again by reading a line half-way between the lower_bound and the previous read trial (if its higher then it splits between its guess and the upper bound). So you keep iterating between upper and lower bounds - this yields the fastest possible "on average" solution.

This should be a real quick solution (log to the base 2 of the number of lines!!). For example in the worst possible case (finding line 999 out of a 1000 lines), using binary search would take just 9 line reads! (from a billion lines would take just 30...)

Assumptions for the below code:

Every line starts with time information.
The times are unique - If not, when a match is found you'll have to check backwards or forwards to include or exclude all entries with matching time, as appropriate (if required).
Amusingly this is a recursive function, so the number of lines of your file is limited to 2**1000 (luckily this allows for quite a large file...)

Further:

This could be adapted to read in arbitrary blocks, rather than by line, if preferred. As suggested by J.F. Sebastian.
In my original answerI suggested this approach but using linecache.getline, while this is possible its inappropriate for large files as it reads the whole file into memory (thus file.seek() is superior), thanks to TerryE and J.F. Sebastian for pointing that out.

import datetime

def match(line):
    lfmt = '%Y-%m-%d %H:%M:%S'
    if line[0] == '[':
        return datetime.datetime.strptime(line[1:20], lfmt)

def retrieve_test_line(position):
    file.seek(position,0)
    file.readline()  # avoids reading partial line, which will mess up match attempt
    new_position = file.tell() # gets start of line position
    return file.readline(), new_position

def check_lower_bound(position):
    file.seek(position,0)
    new_position = file.tell() # gets start of line position
    return file.readline(), new_position

def find_line(target, lower_bound, upper_bound):
    trial = int((lower_bound + upper_bound) /2)
    inspection_text, position = retrieve_test_line(trial)
    if position == upper_bound:
        text, position = check_lower_bound(lower_bound)
        if match(text) == target:
            return position
        return # no match for target within range
    matched_position = match(inspection_text)
    if matched_position == target:
        return position
    elif matched_position < target:
        return find_line(target, position, upper_bound)
    elif matched_position > target:
        return find_line(target, lower_bound, position)
    else:
        return # no match for target within range

lfmt = '%Y-%m-%d %H:%M:%S'
# start_target =  # first line you are trying to find:
start_target =  datetime.datetime.strptime("2012-02-01 13:10:00", lfmt)
# end_target =  # last line you are trying to find:
end_target =  datetime.datetime.strptime("2012-02-01 13:39:00", lfmt)
file = open("log_file.txt","r")
lower_bound = 0
file.seek(0,2) # find upper bound
upper_bound = file.tell()

sequence_start = find_line(start_target, lower_bound, upper_bound)

if sequence_start or sequence_start == 0: #allow for starting at zero - corner case
    sequence_end = find_line(end_target, sequence_start, upper_bound)
    if not sequence_end:
        print "start_target match: ", sequence_start
        print "end match is not present in the current file"
else:
    print "start match is not present in the current file"

if (sequence_start or sequence_start == 0) and sequence_end:
    print "start_target match: ", sequence_start
    print "end_target match: ", sequence_end
    print
    print start_target, 'target'
    file.seek(sequence_start,0)
    print file.readline()
    print end_target, 'target'
    file.seek(sequence_end,0)
    print file.readline()

fraxel is almost correct. You *can't* use linecache since any approach which supports accurate line-number offsets in a variable line length file must do a (lazy) full scan internally. Do a linear interpolation reading 1024 byte blocks and then scan each for the \n+DTS pattern to find the current time. As fraxel says, this is O(log N) on the file size. — TerryE, Mar 11 '12 at 11:06
@TerryE Could you explain your "linear interpolation" suggestion a little, or provide a link? — urschrei, Mar 11 '12 at 14:15
Yup its a simple iterative method which works well *if* the logging rates are uniform. If not then use a binary search. Wikipedia has articles on both. You are using the fact that the file is in ascending DTS to find the correct start and end bytes by probing. this will take ~66 block reads for a 8Gb log, say ~6 sec. Of course you will still need to copy from byte M to N and/or compress on the fly, e.g. `head -c -1234567890 somelog.log | head -c 999999 | gzip -c3 > someextract.log.gz` use a modest compression factor if you do compress. You're really just saving I/O. — TerryE, Mar 11 '12 at 14:28
[`linecache` reads the whole file into memory](http://hg.python.org/cpython/file/2.7/Lib/linecache.py#l127) so it is not suitable to do `O(log N)` searches in 5GB files (it would make sense on smaller files if searching criteria were very computationally expensive i.e., when we can discard a cost of reading a file) — jfs, Mar 11 '12 at 15:26
Ok, fine points, due to the giant file size, better to use the same approach, but with `file.seek` [link](http://docs.python.org/library/stdtypes.html) to move around in the file (instead of `linecache`), should be pretty nippy. — fraxel, Mar 11 '12 at 16:46
`file.seek()` will also allow you to find the end of the file very quickly by scanning through the file from the end - pass `2` as the second argument — fraxel, Mar 11 '12 at 17:14
@TerryE and J.F. Sebastian - Thanks for your comments, I've adjusted my answer accordingly. — fraxel, Mar 12 '12 at 14:10
I've scanned through your code, but I can't give it another +1. Out of interest, how long does it take to run? — TerryE, Mar 12 '12 at 14:33
@TerryE, cheers, well the above code finds the line match in a worst case scenario in a million line file in 0.002 seconds, while a straight read through the file takes 58 seconds. powerful stuff! — fraxel, Mar 12 '12 at 17:33
Not quite, 2 mSec equates to a few dozen file system ops to cached inodes / buffers. The whole this about logN vs N is that you *really* start to notice the difference as N gets large ;-) Enjoy — TerryE, Mar 12 '12 at 22:48
at first sight this code requires that start and end targets should be present in the file. i tried to run your script and it return strings only in this case. so i think it should be alittle modified. — Nikolai Golub, Mar 13 '12 at 16:36
@Nikolay Golub - Would you like me to edit it to remove all the printing stuff at the end, and be useable as an external module? — fraxel, Mar 13 '12 at 17:08
@fraxel, yes, of course. Also, during the time, I've found a bug in your code: `match(inspection_text)` moves pointer through a file, so find line works unstable, if there's unparsible strings near bounds. It's better to call match before checking conditions. I've updated your answer with a fix — Nikolai Golub, Dec 25 '13 at 16:19

score 2 · Answer 2 · edited May 23 '17 at 10:29

5 GB log is parsed about 25 minutes

It is ~3MB/s. Even sequential O(n) scan in Python can do much better (~500MB/s for wc-l.py) i.e., the performance should be limited only by I/O.

To perform a binary search on a file you could adapt FileSearcher that uses fixed records to use lines instead, using approach similar to tail -n implementation in Python (it is O(n) to scan for '\n').

To avoid O(n) (if the date range selects only a small portion of the log) you could use an approximate search that uses large fixed chunks and allows some records to be missed due to they lie on a chunk boundary e.g., use unmodified FileSearcher with record_size=1MB and a custom Query class:

class Query(object):

    def __init__(self, query):
        self.query = query # e.g., '2012-01-01'

    def __lt__(self, chunk):
        # assume line starts with a date; find the start of line
        i = chunk.find('\n')
        # assert '\n' in chunk and len(chunk) > (len(self.query) + i)
        # e.g., '2012-01-01' < '2012-03-01'
        return self.query < chunk[i+1:i+1+len(self.query)]

To take into account that the date range can span multiple chunks you could modify FileSearcher.__getitem__ to return (filepos, chunk) and search twice (bisect_left(), bisect_right()) to find approximate filepos_mindate, filepos_maxdate. After that you could perform a linear search (e.g., using tail -n approach) around given file positions to find the exact first and last log records.

score 1 · Answer 3 · edited May 23 '17 at 12:33

7 to 10 GB is a large amount of data. If I had to analyse that kind of data, I would either make the applications log to a database or upload the log files to a database. Then there is loads of analysis you can do efficiently on database. If you are using a standard logging tool like Log4J logging to a database should be quite simple. Just suggesting an alternate solution.

For more on database logging you can refer to this post:

A good database log appender for Java?

score 0 · Answer 4 · edited Apr 13 '17 at 12:13

If you have access to a windows environment, you can use MS LogParser to read the files and gather any of the information you might need. It uses a SQL syntax which makes using this tool a joy. It also supports a large number of input types.

As an additional bonus, it also supports the iCheckPoint switch which causes a checkpoint file to be created when working with sequential log files. For more info, look in the Log Parse Help under "Advanced Features -> Parsing Input Incrementally"

trim big log file

4 Answers4