1

I have a table already as a python string with 300000+ lines like the following:

  123    1  2.263E-04  2.024E+00  8.943E+03  9.030E+02  2.692E+03  5.448E+03  3.816E-01  1.232E-01  0.000E+00  4.389E+02  1.950E+02

If it helps, this table was generated with the following Fortran FORMAT statement:

FORMAT (2I5,1P11E11.3)

I'd like to see if I can load this faster than pandas.read_csv(..., delim_whitespace=True) which is taking 540ms for me.

text = r'''  372    1  0.000E+00  0.000E+00  0.000E+00  9.150E+02  3.236E+03  0.000E+00  0.000E+00  0.000E+00  0.000E+00  0.000E+00  3.623E+02\n'''*300000
%timeit df = pd.read_csv(StringIO(text), delim_whitespace=True, header=None)

yields:

549 ms ± 3.42 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I thought that knowing the line length and column widths would make read_fwf faster, but it's apparently less optimized:

widths = [5]*2 + [11]*11
%timeit df = pd.read_fwf(StringIO(text), widths=widths, header=None)

yields:

2.95 s ± 29 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Is this something that could be made faster with Cython? I have very little experience with C or Cython, so I unfortunately don't know where to even get started with an initial example. I'm also open to something like f2py, but only if its worth the hassle over Cython. I already have some numba and Cython stuff in my dependencies, so I'm more open to Cython solutions. I realize the numba doesn't deal with text, so it's not helpful for this.

Thanks to anyone who can help!

flutefreak7
  • 2,321
  • 5
  • 29
  • 39
  • You're right that `read_fwf` is less optimized, you might consider the approach suggested here - https://github.com/pandas-dev/pandas/issues/22592 - if I understand correctly, directly mmap your data into a numpy structured array – chrisb Oct 02 '18 at 15:01
  • @chrisb you were right, thanks – cs95 Oct 02 '18 at 18:21

1 Answers1

0

I figured out a Cython solution that serves my needs. This is using the Cython cell magic for Jupyter notebook to handle compiling. I chose 2000000 for array initialization because that's a reasonable upper limit for my data. The function only returns the rows of the numpy array that were actually populated. It's fairly cheap to then pass that numpy array to a pandas dataframe.

I'm not sure how much more optimization can be done given that I'm actually also throwing out some garbage lines which I think rules out memory-mapping. It's possible I could utilize pointers like in an answer to another question I had, but it would be tricky to find the data in my file and detect bad lines (see below for more on my bigger problem of reading pages of data) if I were moving a pointer around instead of iterating lines.

%%cython
import numpy as np
cimport numpy as np
np.import_array()
from libc.stdlib cimport atof
from cpython cimport bool

def read_with_cython(filename):    

    cdef float[:, ::1]  data = np.zeros((2000000, 13), np.float32)
    cdef int i = 0
    with open(filename, 'rb') as f:
        for line in f:
            if len(line) == 133:
                data[i, 0] = atof(line[0:5])
                data[i, 1] = atof(line[5:10])
                data[i, 2] = atof(line[12:21])
                data[i, 3] = atof(line[23:32])
                data[i, 4] = atof(line[34:43])
                data[i, 5] = atof(line[45:54])
                data[i, 6] = atof(line[56:65])
                data[i, 7] = atof(line[67:76])
                data[i, 8] = atof(line[78:87])
                data[i, 9] = atof(line[89:98])
                data[i, 10] = atof(line[100:109])
                data[i, 11] = atof(line[111:120])
                data[i, 12] = atof(line[122:131])

            i += 1

    return data.base[:i]

With this I was able to run the following:

text = '''  372    1  0.000E+00  0.000E+00  0.000E+00  9.150E+02  3.236E+03  0.000E+00  0.000E+00  0.000E+00  0.000E+00  0.000E+00  3.623E+02\n'''*300000

with open('demo_file.txt', 'w') as f:
    f.write(text)

%timeit result = read_with_cython('demo_file.txt')

and get this result:

473 ms ± 6.63 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

For comparison and completeness, I also wrote a quick pure python version:

def read_python(text):
    data = np.zeros((300000, 13), dtype=np.float)
    for i, line in enumerate(text.splitlines()):
        data[i, 0] = float(line[:5])
        data[i, 1] = float(line[5:10])
        for j in range(11):
            a = 10+j*11
            b = a + 11
            data[i, j+2] = float(line[a:b])

    return data

Which ran in 1.15s:

1.15 s ± 8.14 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I then attempted to adapt that to a very simple Cython example that ran in 717ms:

%%cython
def read_python_cy(text):
    text.replace('\r\n', '')
    i = 0

    while True:
        float(line[i:i+5])
        float(line[i+5:i+10])
        for j in range(11):
            a = i+10+j*11
            b = i+a + 11
            float(line[a:b])
        i += 131

    return 0

717 ms ± 5.66 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Then I broke down and figured out the more optimized Cython version above.

It was at that point that I realized Cython could more efficienty solve both this and a slow-ish regex problem. I was using regex to find and capture ~5000 pages of data which were then being concatenated into the table I'm trying to read here. Something closer to my actual Cython function is shown below. This handles finding a page of data, capturing a page-level detail (time), then reading lines of actual data until a stop flag is detected (a line beginning with a 0 or 1). My regex was taking over 1s just to extract the data I wanted, so this is saving me a lot of time overall.

%%cython
import numpy as np
cimport numpy as np
np.import_array()
from libc.stdlib cimport atof
import cython
from cpython cimport bool

def read_pages_cython(filename):    

    cdef int n_pages = 0
    cdef bool reading_page = False
    cdef float[:, ::1]  data = np.zeros((2000000, 14), np.float32)
    cdef int i = 0
    cdef float time
    with open(filename, 'rb') as f:
        for line in f:
            if not reading_page:
                if b'SUMMARY' in line:
                    time = atof(line[73:80])
                    reading_page = True
            else:
                if len(line) == 133:
                    data[i, 0] = atof(line[0:5])
                    # data[i, 1] = atof(line[5:10])
                    data[i, 2] = atof(line[12:21])
                    data[i, 3] = atof(line[23:32])
                    data[i, 4] = atof(line[34:43])
                    data[i, 5] = atof(line[45:54])
                    data[i, 6] = atof(line[56:65])
                    data[i, 7] = atof(line[67:76])
                    data[i, 8] = atof(line[78:87])
                    data[i, 9] = atof(line[89:98])
                    data[i, 10] = atof(line[100:109])
                    data[i, 11] = atof(line[111:120])
                    data[i, 12] = atof(line[122:131])
                    data[i, 13] = time

                if len(line) > 6:
                    if line[:1] == b'1':
                        if b'SUMMARY' in line:
                            time = atof(line[73:80])
                            reading_page = True
                        else:
                            reading_page = False
                            i += 1
                            continue

                    elif line[:1] == b'0':
                        reading_page = False
                        i += 1
                        continue

            i += 1

    return data.base[:i]
flutefreak7
  • 2,321
  • 5
  • 29
  • 39
  • using `atof` might not be a great idea in the long run: if the input is rubbish it happily returns some values without raising an error. You will not be notified that your input is corrupt - one day this will bite you. – ead Oct 04 '18 at 09:24
  • What's the alternative? I could theoretically run a slower version for validation after I've initialized the UI plot for the user. I don't expect invalid data from a Fortran-based analysis program that will never be recompiled.. – flutefreak7 Oct 04 '18 at 11:49
  • As time goes by, things change... There is for example `strtod`: using `endptr` you could check, wether the the conversion was ok. But the real question is: is the speed-up of 10-20% worth rolling out a hand-made and error-prone version? – ead Oct 04 '18 at 12:06
  • Ok, thanks! Looks like this discusses what you're talking about: https://stackoverflow.com/questions/4308536/converting-a-string-into-a-double – flutefreak7 Oct 04 '18 at 14:27