Python: Reading New Information Added To Massive Files

Question

I'm working on a Python script to parse Squid(http://www.squid-cache.org/) log files. While the logs are rotated every day to stop them getting to big, they do reach between 40-90MB by the end of each day.

Essentially what I'm doing is reading the file line by line, parsing out the data I need(IP, Requested URL, Time) and adding it to an sqlite database. However this seems to be taking a very long time(It's been running over 20 minutes now)

So obviously, re-reading the file can't be done. What I would like to do is read the file and then detect all new lines written. Or even better, at the start of the day the script will simply read the data in real time as it is added so there will never be any long processing times.

How would I go about doing this?

score 2 · Accepted Answer · edited May 23 '17 at 12:03

2

One way to achieve this is by emulating tail -f. The script would constantly monitor the file and process each new line as it appears.

For a discussion and some recipes, see tail -f in python with no time.sleep

edited May 23 '17 at 12:03

Community

1
1

answered Nov 29 '11 at 12:33

NPE

486,780
108
951
1,012

I was thinking of using tail but thought it might miss things if they are added too fast. Could not the same thing happen in that example if things are added more often than every 0.1 seconds? – Peter-W Nov 29 '11 at 12:44
@Peter-W: I don't think so: `time.sleep()` is only called when there are no more lines to read. – NPE Nov 29 '11 at 12:54
So if two lines are added while we are reading the current last line, would it not then read the last line on the next loop. Thus missing the first of the two lines? Since readline() gets called every loop and is pointing at the last line in the file. – Peter-W Nov 29 '11 at 15:31
@Peter-W: I don't follow. In essence, the loop just calls `readline()` repeatedly, possibly with some intervening `sleep` calls to avoid spinning. I don't see how a line could possibly be skipped. For this, the file pointer would have to be explicitly moved, and there's no code to do that. – NPE Nov 29 '11 at 15:47
Maybe I'm understanding wrong, however the way I see it readline() will always point to the last line in the file(Right?) so if two lines are added before readline() gets called a second time then readline() will point to the second of these lines? Like this: http://pastebin.com/CNus9iAF – Peter-W Nov 29 '11 at 15:56
@Peter-W: Please see my previous comment - there's nothing in the code that would cause Line 4 to be skipped. Why not experiment with running the code and seeing for yourself? P.S. Note that there's only one loop, not two, and that `thefile.seek` only gets called once. If that's not obvious, it might be worthwhile reading up on generators [PEP 255]. – NPE Nov 29 '11 at 15:59

jsbueno · Answer 2 · 2011-11-29T13:35:16.293

One way to do this is to use file system monitoring with py-inotify http://pyinotify.sourceforge.net/ - and set a callback function to be executed whenever the log file size changed.

Another way to do it, without requiring external modules, is to record in the filesystem (possibily on your sqlite database itself), the offset of the end of the lest line read on the log file, (which you get with with file.tell() ), and just read the newly added lines from that offset onwards, which is done with a simple call to file.seek(offset) before looping through the lines.

The main difference of keeping track of the offset and the "tail" emulation described ont he other post is that this one allows your script to be run multiple times, i.e. - no need for it to be running continually, or to recover in case of a crash.

Python: Reading New Information Added To Massive Files

2 Answers2