I have a python SubProcess call that runs an executable and pipes the output to my subprocess stdout.
In cases where the stdout data is relatively small (~2k lines), the performance between reading line by line and reading as a chunk (stdout.read()) is comparable...with stdout.read() being slightly faster.
Once the data gets to be larger (say 30k+ lines), the performance for reading line by line is significantly better.
This is my comparison script:
proc=subprocess.Popen(executable,stdout=subprocess.PIPE)
tic=time.clock()
for line in (iter(proc.stdout.readline,b'')):
tmp.append(line)
print("line by line = %.2f"%(time.clock()-tic))
proc=subprocess.Popen(executable,stdout=subprocess.PIPE)
tic=time.clock()
fullFile=proc.stdout.read()
print("slurped = %.2f"%(time.clock()-tic))
And these are the results for a read of ~96k lines (or 50mb of on disk memory):
line by line = 5.48
slurped = 153.03
I am unclear why the performance difference is so extreme. My expectation is that the read() version should be faster than storing the results line by line. Of course, I was expecting faster line by line results in practical case where there is significant per line processing that could be done during the read.
Can anyone give me insight into the read() performance cost?