4

I was tired of waiting while loading a simple distance matrix from a csv file using numpy.genfromtxt. Following another SO question, I performed a perfplot test, while including some additional methods. The results (source code at the end):

perfplot benchmark

The result for the largest input size shows that the best method is read_csv, which is this:

def load_read_csv(path: str):
    with open(path, 'r') as csv_file:
        reader = csv.reader(csv_file)
        matrix = None
        first_row = True
        for row_index, row in enumerate(reader):
            if first_row:
                size = len(row)
                matrix = np.zeros((size, size), dtype=int)
                first_row = False
            matrix[row_index] = row

    return matrix

Now I doubt that reading the file line by line, converting it to the list of strings, then calling int() on each item in the list and adding it to NumPy matrix is the best possible way.

Can this function be optimized further, or is there some fast library for CSV loading (like Univocity parser in Java), or maybe just a dedicated NumPy function?

The source code of the test:

import perfplot
import csv
import numpy as np
import pandas as pd


def load_read_csv(path: str):
    with open(path, 'r') as csv_file:
        reader = csv.reader(csv_file)
        matrix = None
        first_row = True
        for row_index, row in enumerate(reader):
            if first_row:
                size = len(row)
                matrix = np.zeros((size, size), dtype=int)
                first_row = False
            # matrix[row_index] = [int(item) for item in row]
            matrix[row_index] = row

    return matrix


def load_loadtxt(path: str):
    matrix = np.loadtxt(path, dtype=int, comments=None, delimiter=",", encoding="utf-8")
    return matrix


def load_genfromtxt(path: str):
    matrix = np.genfromtxt(path, dtype=int, comments=None, delimiter=",", deletechars=None, replace_space=None, encoding="utf-8")
    return matrix


def load_pandas(path: str):
    df = pd.read_csv(path, header=None, dtype=np.int32)
    return df.values


def load_pandas_engine_pyarrow(path: str):
    df = pd.read_csv(path, header=None, dtype=np.int32, engine='pyarrow')
    return df.values


def load_pandas_engine_python(path: str):
    df = pd.read_csv(path, header=None, dtype=np.int32, engine='python')
    return df.values


def setup(n):
    matrix = np.random.randint(0, 10000, size=(n, n), dtype=int)
    filename = f"square_matrix_of_size_{n}.csv"
    np.savetxt(filename, matrix, fmt="%d", delimiter=",")
    return filename


b = perfplot.bench(
    setup=setup,  # or setup=np.random.rand
    kernels=[
        load_read_csv,
        load_loadtxt,
        load_genfromtxt,
        load_pandas,
        load_pandas_engine_pyarrow,
        load_pandas_engine_python
    ],
    n_range=[2 ** k for k in range(15)]
)
b.save("out.png")
b.show()

mkrieger1
  • 19,194
  • 5
  • 54
  • 65
Fido
  • 320
  • 1
  • 2
  • 15
  • I think `load_read_csv` is cheating a little bit because it assumes the number of rows to be `size` and preallocates the required array. The other methods don't do that. Maybe repeat this test without the assumption in `load_read_csv`. – mkrieger1 Feb 03 '22 at 14:05
  • But dm is always a square matrix, so I need the opposite: I need to tell other methods to allocate the array upfront :) – Fido Feb 03 '22 at 14:07
  • 1
    I see. I rephrased the question a bit to make this more clear. – mkrieger1 Feb 03 '22 at 14:11
  • If there was a "dedicated numpy function", don't you think the previous links would have mentioned it :( I recently explored a single column case, https://stackoverflow.com/questions/70602007/convert-txt-with-300-million-rows-to-numpy-array – hpaulj Feb 03 '22 at 17:22

1 Answers1

3

Parsing CSV files correctly while supporting several data types (eg. floating-point numbers, integers, strings) and possibly ill-formed input files is clearly not easy, and doing so efficiently is actually pretty hard. Moreover, decoding UTF-8 strings is also much slower than reading directly ASCII strings. This is the reasons why most CSV libraries are pretty slow. Not to mention wrapping library in Python could introduce pretty big overheads regarding the input types (especially string).

Hopefully, if you need to read a CSV file containing a square matrix of integers that is assumed to be correctly formed, then you can write a much faster specific code dedicated to your needs (which does not care about floating-point numbers, strings, UTF-8, header decoding, error handling, etc.).

That being said, any call to a basic CPython function tends to introduce a huge overhead. Even a simple call to open+read is relatively slow (the binary mode is significantly faster than the text mode but unfortunately not so fast). The trick is to use Numpy to load the whole binary file in RAM with np.fromfile. This function is extremely fast: it just read the whole file at once, put its binary content in a raw memory buffer and return a view on it. When the file is in the operating system cache or a high-throughput NVMe SSD storage device, it can load the file at the speed of several GiB/s.

One the file is loaded, you can decode it with Numba (or Cython) so the decoding can be nearly as fast as a native code. Note that Numba does not support well/efficiently strings/bytes. Hopefully, the function np.fromfile produces a contiguous byte array and Numba can compute it very quickly. You can know the size of the matrix by just reading the first line and counting the number of comma. Then you can fill the matrix very efficiently by decoding integer on-the-fly, packing them in a flatten matrix and just consider end-of-line characters as regular separators. Note that \r and \n can both appear in the file since the file is read in binary mode.

Here is the resulting implementation:

import numba as nb
import numpy as np

@nb.njit('int32[:,:](uint8[::1],)', cache=True)
def decode_csv_buffer(rawData):
    COMMA = np.uint8(ord(','))
    CR = np.uint8(ord('\r'))
    LF = np.uint8(ord('\n'))
    ZERO = np.uint8(ord('0'))

    # Find the size of the matrix (`n`)

    n = 0
    lineSize = 0

    for i in range(rawData.size):
        c = rawData[i]
        if c == CR or c == LF:
            break
        n += rawData[i] == COMMA
        lineSize += 1
    
    n += 1

    # Empty matrix
    if lineSize == 0:
        return np.empty((0, 0), dtype=np.int32)

    # Initialization

    res = np.empty(n * n, dtype=np.int32)

    # Fill the matrix

    curInt = 0
    curPos = 0
    lastCharIsDigit = True

    for i in range(len(rawData)):
        c = rawData[i]
        if c == CR or c == LF or c == COMMA:
            if lastCharIsDigit:
                # Write the last int in the flatten matrix
                res[curPos] = curInt
                curPos += 1
                curInt = 0
            lastCharIsDigit = False
        else:
            curInt = curInt * 10 + (c - ZERO)
            lastCharIsDigit = True

    return res.reshape(n, n)

def load_numba(filename):
    # Load fully the file in a raw memory buffer
    rawData = np.fromfile(filename, dtype=np.uint8)

    # Decode the buffer using the Numba JIT
    # This method only work for your specific needs and 
    # can simply crash if the file content is invalid.
    return decode_csv_buffer(rawData)

Be aware that the code is not robust (any bad input results in an undefined behaviour, including a crash), but it is very fast.

Here are the results on my machine:

enter image description here

As you can see, the above Numba implementation is at least one order of magnitude faster than all others. Note that you can write an even faster code using multiple threads during the decoding, but this makes the code significantly more complex.

Jérôme Richard
  • 41,678
  • 6
  • 29
  • 59