Efficient cython file reading, string parsing, and array building

Question

So I have some data files that look like this:

      47
   425   425  -3 15000 15000 900   385   315   3   370   330   2   340   330   2
   325   315   2   325   240   2   340   225   2   370   225   2   385   240   2
   385   315   2   475   240   3   460   240   2   460   255   2   475   255   2
   475   240   2   595   315   3   580   330   2   550   330   2   535   315   2
   535   240   2   550   225   2   580   225   2   595   240   2   595   315   2
   700   315   3   685   330   2   655   330   2   640   315   2   640   240   2
   655   225   2   685   225   2   700   240   2   700   315   2   700   315   3
  9076   456   2  9102   449   2  9127   443   2  9152   437   2  9178   433   2
  9203   430   2  9229   428   2  9254   427   2  9280   425   2  9305   425   2
     0     0 999  6865    259999
      20
   425   425  -3 15000 15000 900   385   315   3   370   330   2   340   330   2
   325   315   2   325   240   2   340   225   2   370   225   2   385   240   2
   385   315   2   475   240   3   460   240   2   460   255   2   475   255   2
   475   240   2   595   315   3   580   330   2   550   330   2   535   315   2

The first number is the number of points in the following block of text, and then the block of text has that many points with up to 5 points per line. Each point has 3 components (I'll call them x, y, z). x and y get 6 characters, while z gets 4, so each point takes 16 characters. Occasionally z is 9999 resulting in no space between y and z, so using split() will mess up parsing those lines. Also, all the numbers are integers (no decimals), but there are some negatives.

In the actual file the blocks are generally 1000 points long with some blocks being smaller (at the end of a "page" where page breaks are denoted by z=9999)

My initial solution was to use regex:

import re
def get_points_regex(filename):
    with open(filename, 'r') as f:
        text = f.read()
    points = []
    for m in re.finditer('([ \d-]{6})([ \d-]{6})([ \d\-]{4})', text):
        point = tuple(int(i) for i in m.groups())
        points.append(point)
    return points

My test file is 55283 lines long (4.4 MB) and contains 274761 points.

Using timeit on get_points_regex I get 560 ms.

I then figured that while finditer is memory efficient, generating thousands of match objects is slow when I don't need any of their features, so I made a version using re.findall:

def get_points_regex2():
    with open(filename, 'r') as f:
        text = f.read()
    points = re.findall(r'([ \d-]{6})([ \d-]{6})([ \d\-]{4})', text)
    points = [tuple(map(int, point)) for point in points]
    return points

This version runs in 414 ms, 1.35x faster than finditer.

Then I was thinking that for such simple patterns regex might be overkill, so I made a version using pure python:

def get_points_simple():
    points = []
    with open(filename, 'r') as f:
        for line in f:
            n_chunks = int(len(line)/16)
            for i in range(n_chunks):
                chunk = line[16*i:16*(i+1)]
                x = int(chunk[0:6])
                y = int(chunk[6:12])
                z = int(chunk[12:16])
                points.append((x, y, z))
    return points

This runs in 386 ms, 1.07x faster than regex.

Then I broke down and tried Cython for the first time. I'm just running using the %%cython cell magic in a jupyter notebook. I came up with this:

%%cython
def get_points_cython(filename):
    cdef int i, x, y, z
    points = []
    f = open(filename, 'r')
    for line in f:
        n_chunks = int(len(line)/16)
        for i in range(n_chunks):
            chunk = line[16*i:16*(i+1)]
            x = int(chunk[0:6])
            y = int(chunk[6:12])
            z = int(chunk[12:16])
            points.append((x, y, z))

    f.close()
    return points

The cython function runs in 196 ms. (2x faster than pure python)

I tried to simplify some expressions, like not using a context manager for file opening. While I declared the integers I wasn't sure what else to do so I left the rest alone. I made a couple attempts at doing a 2D integer array instead of a list of tuples for points, but python segfaulted (I'm assuming that's what happened, the IPython kernal died). I had cdef int points[1000000][3] then I assigned with statements like points[j][1] = x while incrementing j. From some light reading and very little C background I think that might be a rather large array? Stack vs. heap (I don't know what these really are)? Need things like malloc? I'm a bit lost on that stuff.

Next I had read that maybe I should just use Numpy since Cython is good at that. Following this I was able to create this function:

%%cython
import numpy as np
cimport numpy as np
DTYPE = np.int
ctypedef np.int_t DTYPE_t

def get_points_cython_numpy(filename):
    cdef int i, j, x, y, z
    cdef np.ndarray points = np.zeros([1000000, 3], dtype=DTYPE)
    f = open(filename, 'r')
    j = 0
    for line in f:
        n_chunks = int(len(line)/16)
        for i in range(n_chunks):
            chunk = line[16*i:16*(i+1)]
            x = int(chunk[0:6])
            y = int(chunk[6:12])
            z = int(chunk[12:16])
            points[j, 0] = x
            points[j, 1] = y
            points[j, 2] = z
            j = j + 1

    f.close()
    return points

Unfortunately this takes 263 ms, so a little slower.

Am I missing something obvious with cython or python std lib that would make parsing this any faster, or is this about as fast as it gets for a file of this size?

I thought about pandas and numpy loading functions, but I figured the chunk size rows would complicate it too much. At one point I about had something working with pandas read_fwf followed by DataFrame.values.reshape(-1, 3), then drop rows with NaNs, but I knew that had to be slower by that point.

Any ideas to speed this up would be very appreciated!

I'd love to get this below 100ms so that a GUI can be updated rapidly from reading these files as they get generated. (Move slider > run background analysis > load data > plot results in real time).

Do the data files have to be in text format? If you didn't have to convert from strings to integers, you should be able to read the data faster. — Håken Lid, May 13 '16 at 22:56
Thanks for the idea, but the files are produced by an analysis code that I have no control over. — flutefreak7, May 13 '16 at 23:25
Now that I have a ~15ms solution to this problem, I've added a [follow-on question about grouping and categorizing this data for plotting](http://stackoverflow.com/questions/37330104/efficiently-grouping-and-sorting-data-with-cython). — flutefreak7, May 19 '16 at 18:32

HYRY · Accepted Answer · 2016-05-20T10:28:48.330

Here is a faster example, it use fast_atoi() to convert string to int, it's 2x faster then get_points_cython() on my pc. If the number of points line have the same width (8 chars), then I think I can speedup it further (about 12x faster then get_points_cython()).

%%cython
import numpy as np
cimport numpy as np
import cython

cdef int fast_atoi(char *buff):
    cdef int c = 0, sign = 0, x = 0
    cdef char *p = buff
    while True:
        c = p[0]
        if c == 0:
            break
        if c == 45:
            sign = 1
        elif c > 47 and c < 58:
            x = x * 10 + c - 48
        p += 1
    return -x if sign else x

@cython.boundscheck(False)
@cython.wraparound(False)
def get_points_cython_numpy(filename):
    cdef int i, j, x, y, z, n_chunks
    cdef bytes line, chunk
    cdef int[:, ::1] points = np.zeros([500000, 3], np.int32)
    f = open(filename, 'rb')
    j = 0
    for line in f:
        n_chunks = int(len(line)/16)
        for i in range(n_chunks):
            chunk = line[16*i:16*(i+1)]
            x = fast_atoi(chunk[0:6])
            y = fast_atoi(chunk[6:12])
            z = fast_atoi(chunk[12:16])
            points[j, 0] = x
            points[j, 1] = y
            points[j, 2] = z
            j = j + 1

    f.close()
    return points.base[:j]

Here is the fasest method, the idea is read the whole file content into a bytes object, and get points data from it.

@cython.boundscheck(False)
@cython.wraparound(False)
cdef inline int fast_atoi(char *buf, int size):
    cdef int i=0 ,c = 0, sign = 0, x = 0
    for i in range(size):
        c = buf[i]
        if c == 0:
            break
        if c == 45:
            sign = 1
        elif c > 47 and c < 58:
            x = x * 10 + c - 48
    return -x if sign else x

@cython.boundscheck(False)
@cython.wraparound(False)
def fastest_read_points(fn):
    cdef bytes buf
    with open(fn, "rb") as f:
        buf = f.read().replace(b"\n", b"") # change it with your endline.

    cdef char * p = buf
    cdef int length = len(buf)
    cdef char * buf_end = p + length
    cdef int count = length // 16 * 2 # create enough large array  
    cdef int[:, ::1] res = np.zeros((count, 3), np.int32)
    cdef int i, j, block_count
    i = 0
    while p < buf_end:
        block_count = fast_atoi(p, 10)
        p += 10
        for j in range(block_count):
            res[i, 0] = fast_atoi(p, 6)
            res[i, 1] = fast_atoi(p+6, 6)
            res[i, 2] = fast_atoi(p+12, 4)
            p += 16
            i += 1
    return res.base[:i]

This working fantastically! Of the 9 different methods I'm benchmarking now, this is 93.6 ms, which is the fastest. I tried using the number of points line to read each ~1000 pt. chunk of data at a time, instead of line by line (similar to the another answer), and then use fast_atoi, but it's a bit slower at 101 ms). The points per chunk lines are each 10 characters wide and contain an integer between 1 and 1000 if that helps. — flutefreak7, May 17 '16 at 15:59
looks like you're using a new fast_atoi that takes 2 arguments... edit: do I just change `while True` to `for _ in range(n):` perhaps? — flutefreak7, May 19 '16 at 15:03
Whoa! It's now under 15ms! Nice! I editted fast_atoi to work with the n argument as mentioned in my previous comment. Now my bottleneck is sorting the points into pages and lines (z denotes new line segment, new page, change color, change line style, etc.) - which I'm doing in cython, but I'm using python lists and dicts, because I don't know the c way to do it. I'll make a new question if I decide I need to optimize that further. — flutefreak7, May 19 '16 at 15:13
I've added a [new question](http://stackoverflow.com/questions/37330104/efficiently-grouping-and-sorting-data-with-cython) about grouping and categorizing this data in cython after it has been read. — flutefreak7, May 19 '16 at 18:31
@HYRY `fast_atoi` in the first code block takes 5x more than int() type casting on my machine. Any possible reason why? I'm using MacOS and Jupyter notebook. — Fenil, Mar 31 '19 at 03:58

score 1 · Answer 2 · answered May 14 '16 at 21:47

Files that are fixed format and well behaved can be read efficiently with Numpy. The idea is to read the file into an array of strings and then convert to integers in one go. The tricky bit is the handling of variable-width fields and the placement of newline characters. One way to do it for your file is:

def read_chunk_numpy(fh, n_points):
    # 16 chars per point, plus one newline character for every 5 points
    n_bytes = n_points * 16 + (n_points + 1) // 5

    txt_arr = np.fromfile(fh, 'S1', n_bytes)
    txt_arr = txt_arr[txt_arr != b'\n']    
    xyz = txt_arr.view('S6,S6,S4').astype('i,i,i')
    xyz.dtype.names = 'x', 'y', 'z'
    return xyz

Note that \n newline characters are assumed, so some more effort is needed for portability. This gave me a huge speedup compared to the plain Python loop. Test code:

import numpy as np

def write_testfile(fname, n_points):
    with open(fname, 'wb') as fh:
        for _ in range(n_points // 1000):
            n_chunk = np.random.randint(900, 1100)
            fh.write(str(n_chunk).rjust(8) + '\n')
            xyz = np.random.randint(10**4, size=(n_chunk, 3))
            for i in range(0, n_chunk, 5):
                for row in xyz[i:i+5]:
                    fh.write('%6i%6i%4i' % tuple(row))
                fh.write('\n')

def read_chunk_plain(fh, n_points):
    points = []
    count = 0
    # Use while-loop because `for line in fh` would mess with file pointer
    while True:
        line = fh.readline()
        n_chunks = int(len(line)/16)
        for i in range(n_chunks):
            chunk = line[16*i:16*(i+1)]
            x = int(chunk[0:6])
            y = int(chunk[6:12])
            z = int(chunk[12:16])
            points.append((x, y, z))

            count += 1
            if count == n_points:
                return points

def test(fname, read_chunk):
    with open(fname, 'rb') as fh:
        line = fh.readline().strip()
        while line:
            n = int(line)
            read_chunk(fh, n)
            line = fh.readline().strip()

fname = 'test.txt'
write_testfile(fname, 10**5)
%timeit test(fname, read_chunk_numpy)
%timeit test(fname, read_chunk_plain)

So I've worked with this for a while. I indeed have Windows newlines to contend with, so I must account for 2 characters for every line. Also, to account for the final newline when the chunk is not divisible by 5, I needed a `ceiling()`-like operation (`(n + 4) // 5` is the same as `math.ceil(n//5)`) to round up to the next multiple of 5. So for me `n_bytes = n_points * 16 + 2*((n_points+4) // 5)` worked well. Then getting rid of newlines became: `chunk = chunk[(chunk != b'\n') & (chunk != b'\r')]`. Finally I needed a list and numpy.concatenate to combine the data. — flutefreak7, May 17 '16 at 14:40
In an effort to get more speed I moved the data conversion after the concatenation, but that didn't help much. Hmm, it looks like the concatenate is slowing it down... trying without that now... no, it's the snazzy `view().astype()` that's a bit slower apparently. I'm getting 387ms which is comparable to my simple pure python method. — flutefreak7, May 17 '16 at 14:46

Efficient cython file reading, string parsing, and array building

2 Answers2

Linked