Python calculate lines per second / minute being written to a file

Question

I'm interested in building a python script that can give me stats on how many lines per interval (maybe minute) are being written to a file. I have files that are being written as data comes in, a new line for each user the passes data through the external program. Knowing how many lines per x gives me a metric that I can use for future expansion planning. The output file(s) consist of lines, all relatively the same length and all with line returns at the end. I was thinking of writing a script that did something like: measures the length of the file at a specific point and then measures it again at another point in the future, subtract the two and get my result... however I don't know if this is ideal since it takes time to measure the length of the file and that may skew my results. Does anyone have any other ideas?

based on what people are saying I threw this together to start:

import os
import subprocess
import time
from daemon import runner
#import daemon

inputfilename="/home/data/testdata.txt"

class App():
    def __init__(self):
        self.stdin_path = '/dev/null'
        self.stdout_path = '/dev/tty'
        self.stderr_path = '/dev/tty'
        self.pidfile_path =  '/tmp/count.pid'
        self.pidfile_timeout = 5
    def run(self):
        while True:
            count = 0

            FILEIN = open(inputfilename, 'rb')
            while 1:
              buffer = FILEIN.read(8192*1024)
              if not buffer: break
              count += buffer.count('\n')
            FILEIN.close(  )
            print count
            # set the sleep time for repeated action here:
            time.sleep(60)

app = App()
daemon_runner = runner.DaemonRunner(app)
daemon_runner.do_action()

It does the job of getting the count every 60 seconds and printing it out to the screen, my next step is the math I guess.

One more edit: I've added the output of the count in one minute intervals:

import os
import subprocess
import time
from daemon import runner
#import daemon

inputfilename="/home/data/testdata.txt"


class App():
    def __init__(self):
        self.stdin_path = '/dev/null'
        self.stdout_path = '/dev/tty'
        self.stderr_path = '/dev/tty'
        self.pidfile_path =  '/tmp/twitter_counter.pid'
        self.pidfile_timeout = 5
    def run(self):
        counter1 = 0
        while True:
            count = 0

            FILEIN = open(inputfilename, 'rb')
            while 1:
              buffer = FILEIN.read(8192*1024)
              if not buffer: break
              count += buffer.count('\n')
            FILEIN.close(  )

            print count - counter1

            counter1 = count
            # set the sleep time for repeated action here:
            time.sleep(60)

app = App()
daemon_runner = runner.DaemonRunner(app)
daemon_runner.do_action()

Are the lines the same or very similar sizes? While counting the lines in a very large file takes a long time, finding the size of a file is much faster. — David Robinson, May 17 '12 at 19:08
Are you able to edit the program that is writing the data? You could modify it so it periodically reports how many lines it has written. — Kevin, May 17 '12 at 19:11
The lines are somewhat simular, like Name, address, etc... however the answers to some survey questions are longer if the person writes more... however I agree that finding an average lines size and file size and some devision might be the way to go... I don't have the ability to modify the input program, though I can talk to the person who wrote it and ask them if I can. — secumind, May 17 '12 at 19:32
Whatever solution you end up with, I suggest timing against specialized software (e.g. wc). — TryPyPy, May 17 '12 at 19:37

score 1 · Accepted Answer · edited May 23 '17 at 12:28

To comment on your idea (which seems pretty sound to me), how accurate do you need the measurement to be?

I'd suggest to measure the measurement time first. Then, given the relative accuracy you want to achieve, you can calculate the time interval between consecutive measurements, e.g. if measurement takes t milliseconds and you want 1% accuracy, don't measure more often than once in 100t ms.

Although, measurement time will grow as the file grows, you'll have to keep that in mind.

Hint on how to count the lines in a file: is there a built-in python analog to unix 'wc' for sniffing a file?

Hint on how to measure time: time module.

P.S. I just tried timing the line-counter on a 245M file. First time it took about 10 seconds (didn't time it on the first run) but then it was always below 1s. Maybe some caching is done there, I'm not sure.

@ Lev Levitky : based on your comment I edited my above post with the start of a script — secumind, May 17 '12 at 19:47

Python calculate lines per second / minute being written to a file

1 Answers1