2

I have lines in a log file, appended chronologically. For example, it could be data from the last 30 days, starting 30 days ago, then 29 days ago, then 28 days ago, etc.

I want to read the file in normal chronological order, but starting from a certain point (e.g., starting 7 days ago, read 7 days ago data, then 6 days ago data, then 5 days ago data, etc.)

One method is just reading the file normally, however for speed reasons I will need to: - seek from the end of the file backward, exponentially, to find the right point to start at - then, once I found the right point to start at, read lines one by one, in forward order

I'm having trouble getting this to work. I started by modifying the answer here: Most efficient way to search the last x lines of a file in python

Can someone help, or provide guidance on a better way to do this?

Community
  • 1
  • 1
Marvin K
  • 437
  • 1
  • 5
  • 11
  • I would consider splitting the log file into multiple files--each covering an appropriate duration to make seeking from the beginning feasible. (That is, if you have the option) – Joel Cornett Mar 25 '12 at 15:57
  • 4
    I don't think there's much point in reading backward *exponentially*, given that your plan is then to read the *entire* file from that point forward. Reading backward exponentially, plus binary search once you've passed the point you want, would help you find the first needed line in O(log N) time, but that's just pointless complexity for you, since it will take you O(N) time to read the lines from that point forward. – ruakh Mar 25 '12 at 15:58

2 Answers2

2

If speed is a concern, that probably means you are doing it many times, or have to do it on-the-fly. Thus, you could build an index file showing the position you have to seek to for each day, something like:

Day 1: 0
Day 2: 1048576
Day 3: 2097152
Day 4: 6291456
....

This would allow fast lookup of any day once the index is built.

The algorithm for updating this index would be to start at the position of the last known day, read forward, and each time you reach a new day add it to the index.

David Robinson
  • 77,383
  • 16
  • 167
  • 187
0

As the lines are sequential you can do a half-interval search to very quickly (order of log N) get to the start day of interest, and then read forward from there. For example if the log file had a billion lines it would take a maximum of 30 reads to find the start day of interest...

fraxel
  • 34,470
  • 11
  • 98
  • 102