I figured out a Cython solution that serves my needs. This is using the Cython cell magic for Jupyter notebook to handle compiling. I chose 2000000 for array initialization because that's a reasonable upper limit for my data. The function only returns the rows of the numpy array that were actually populated. It's fairly cheap to then pass that numpy array to a pandas dataframe.
I'm not sure how much more optimization can be done given that I'm actually also throwing out some garbage lines which I think rules out memory-mapping. It's possible I could utilize pointers like in an answer to another question I had, but it would be tricky to find the data in my file and detect bad lines (see below for more on my bigger problem of reading pages of data) if I were moving a pointer around instead of iterating lines.
%%cython
import numpy as np
cimport numpy as np
np.import_array()
from libc.stdlib cimport atof
from cpython cimport bool
def read_with_cython(filename):
cdef float[:, ::1] data = np.zeros((2000000, 13), np.float32)
cdef int i = 0
with open(filename, 'rb') as f:
for line in f:
if len(line) == 133:
data[i, 0] = atof(line[0:5])
data[i, 1] = atof(line[5:10])
data[i, 2] = atof(line[12:21])
data[i, 3] = atof(line[23:32])
data[i, 4] = atof(line[34:43])
data[i, 5] = atof(line[45:54])
data[i, 6] = atof(line[56:65])
data[i, 7] = atof(line[67:76])
data[i, 8] = atof(line[78:87])
data[i, 9] = atof(line[89:98])
data[i, 10] = atof(line[100:109])
data[i, 11] = atof(line[111:120])
data[i, 12] = atof(line[122:131])
i += 1
return data.base[:i]
With this I was able to run the following:
text = ''' 372 1 0.000E+00 0.000E+00 0.000E+00 9.150E+02 3.236E+03 0.000E+00 0.000E+00 0.000E+00 0.000E+00 0.000E+00 3.623E+02\n'''*300000
with open('demo_file.txt', 'w') as f:
f.write(text)
%timeit result = read_with_cython('demo_file.txt')
and get this result:
473 ms ± 6.63 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
For comparison and completeness, I also wrote a quick pure python version:
def read_python(text):
data = np.zeros((300000, 13), dtype=np.float)
for i, line in enumerate(text.splitlines()):
data[i, 0] = float(line[:5])
data[i, 1] = float(line[5:10])
for j in range(11):
a = 10+j*11
b = a + 11
data[i, j+2] = float(line[a:b])
return data
Which ran in 1.15s:
1.15 s ± 8.14 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I then attempted to adapt that to a very simple Cython example that ran in 717ms:
%%cython
def read_python_cy(text):
text.replace('\r\n', '')
i = 0
while True:
float(line[i:i+5])
float(line[i+5:i+10])
for j in range(11):
a = i+10+j*11
b = i+a + 11
float(line[a:b])
i += 131
return 0
717 ms ± 5.66 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Then I broke down and figured out the more optimized Cython version above.
It was at that point that I realized Cython could more efficienty solve both this and a slow-ish regex problem. I was using regex to find and capture ~5000 pages of data which were then being concatenated into the table I'm trying to read here. Something closer to my actual Cython function is shown below. This handles finding a page of data, capturing a page-level detail (time), then reading lines of actual data until a stop flag is detected (a line beginning with a 0 or 1). My regex was taking over 1s just to extract the data I wanted, so this is saving me a lot of time overall.
%%cython
import numpy as np
cimport numpy as np
np.import_array()
from libc.stdlib cimport atof
import cython
from cpython cimport bool
def read_pages_cython(filename):
cdef int n_pages = 0
cdef bool reading_page = False
cdef float[:, ::1] data = np.zeros((2000000, 14), np.float32)
cdef int i = 0
cdef float time
with open(filename, 'rb') as f:
for line in f:
if not reading_page:
if b'SUMMARY' in line:
time = atof(line[73:80])
reading_page = True
else:
if len(line) == 133:
data[i, 0] = atof(line[0:5])
# data[i, 1] = atof(line[5:10])
data[i, 2] = atof(line[12:21])
data[i, 3] = atof(line[23:32])
data[i, 4] = atof(line[34:43])
data[i, 5] = atof(line[45:54])
data[i, 6] = atof(line[56:65])
data[i, 7] = atof(line[67:76])
data[i, 8] = atof(line[78:87])
data[i, 9] = atof(line[89:98])
data[i, 10] = atof(line[100:109])
data[i, 11] = atof(line[111:120])
data[i, 12] = atof(line[122:131])
data[i, 13] = time
if len(line) > 6:
if line[:1] == b'1':
if b'SUMMARY' in line:
time = atof(line[73:80])
reading_page = True
else:
reading_page = False
i += 1
continue
elif line[:1] == b'0':
reading_page = False
i += 1
continue
i += 1
return data.base[:i]