12

I have an awkward CSV file which has multiple delimiters: the delimiter for the non-numeric part is ',', for the numeric part ';'. I want to construct a dataframe only out of the numeric part as efficiently as possible.

I have made 5 attempts: among them, utilising the converters argument of pd.read_csv, using regex with engine='python', using str.replace. They are all more than 2x slower than reading the entire CSV file with no conversions. This is prohibitively slow for my use case.

I understand the comparison isn't like-for-like, but it does demonstrate the overall poor performance is not driven by I/O. Is there a more efficient way to read in the data into a numeric Pandas dataframe? Or the equivalent NumPy array?

The below string can be used for benchmarking purposes.

# Python 3.7.0, Pandas 0.23.4

from io import StringIO
import pandas as pd
import csv

# strings in first 3 columns are of arbitrary length
x = '''ABCD,EFGH,IJKL,34.23;562.45;213.5432
MNOP,QRST,UVWX,56.23;63.45;625.234
'''*10**6

def csv_reader_1(x):
    df = pd.read_csv(x, usecols=[3], header=None, delimiter=',',
                     converters={3: lambda x: x.split(';')})
    return df.join(pd.DataFrame(df.pop(3).values.tolist(), dtype=float))

def csv_reader_2(x):
    df = pd.read_csv(x, header=None, delimiter=';',
                     converters={0: lambda x: x.rsplit(',')[-1]}, dtype=float)
    return df.astype(float)

def csv_reader_3(x):
    return pd.read_csv(x, usecols=[3, 4, 5], header=None, sep=',|;', engine='python')

def csv_reader_4(x):
    with x as fin:
        reader = csv.reader(fin, delimiter=',')
        L = [i[-1].split(';') for i in reader]
        return pd.DataFrame(L, dtype=float)

def csv_reader_5(x):
    with x as fin:
        return pd.read_csv(StringIO(fin.getvalue().replace(';',',')),
                           sep=',', header=None, usecols=[3, 4, 5])

Checks:

res1 = csv_reader_1(StringIO(x))
res2 = csv_reader_2(StringIO(x))
res3 = csv_reader_3(StringIO(x))
res4 = csv_reader_4(StringIO(x))
res5 = csv_reader_5(StringIO(x))

print(res1.head(3))
#        0       1         2
# 0  34.23  562.45  213.5432
# 1  56.23   63.45  625.2340
# 2  34.23  562.45  213.5432

assert all(np.array_equal(res1.values, i.values) for i in (res2, res3, res4, res5))

Benchmarking results:

%timeit csv_reader_1(StringIO(x))  # 5.31 s per loop
%timeit csv_reader_2(StringIO(x))  # 6.69 s per loop
%timeit csv_reader_3(StringIO(x))  # 18.6 s per loop
%timeit csv_reader_4(StringIO(x))  # 5.68 s per loop
%timeit csv_reader_5(StringIO(x))  # 7.01 s per loop
%timeit pd.read_csv(StringIO(x))   # 1.65 s per loop

Update

I'm open to using command-line tools as a last resort. To that extent, I have included such an answer. My hope is there is a pure-Python or Pandas solution with comparable efficiency.

Community
  • 1
  • 1
jpp
  • 159,742
  • 34
  • 281
  • 339
  • have you considered using regular expressions for multiple delimiters? For example: [link 1](https://stackoverflow.com/questions/26551662/import-text-to-pandas-with-multiple-delimiters), [link 2](https://stackoverflow.com/questions/31194669/use-multiple-character-delimiter-in-python-pandas-read-csv). Not sure if it'd be any faster. – chris Jan 04 '19 at 18:39
  • @chris, Now I have (see edit), regex with `engine='python'` is ~8x slower than `pd.read_csv` with no converters. – jpp Jan 04 '19 at 18:45
  • @jpp, what if you use `engine=c` , as document suggested The C engine is faster while the Python engine is currently more feature-complete. – Karn Kumar Jan 04 '19 at 19:28
  • 1
    @pygo, the docs explain regex only works with engine python. No go. – jpp Jan 04 '19 at 20:12
  • What is stopping you just replacing all the ; for , in the CSV file and importing it normally? – DrMcCleod Jan 07 '19 at 12:37
  • `replacing all the ; for , in the CSV file`: that's the problem.. Possibly the *best* answer is to do this via some command line tool, i.e. Python isn't good for this purpose. In which case, that's the answer, together with a suitable rationale.. Except now the next question is what's the most efficient tool to do this (probably out of scope for SO). – jpp Jan 07 '19 at 12:42
  • Elements in the first three columns are always 4 characters long. It is always the case for your data? – ead Jan 08 '19 at 15:10
  • @ead, No, that isn't the case, they are arbitrary strings, I'll edit to clarify that. – jpp Jan 08 '19 at 15:10

6 Answers6

6

Use a command-line tool

By far the most efficient solution I've found is to use a specialist command-line tool to replace ";" with "," and then read into Pandas. Pandas or pure Python solutions do not come close in terms of efficiency.

Essentially, using CPython or a tool written in C / C++ is likely to outperform Python-level manipulations.

For example, using Find And Replace Text:

import os

os.chdir(r'C:\temp')                       # change directory location
os.system('fart.exe -c file.csv ";" ","')  # run FART with character to replace

df = pd.read_csv('file.csv', usecols=[3, 4, 5], header=None)  # read file into Pandas
jpp
  • 159,742
  • 34
  • 281
  • 339
2

How about using a generator to do the replacement, and combining it with an appropriate decorator to get a file-like object suitable for pandas?

import io
import pandas as pd

# strings in first 3 columns are of arbitrary length
x = '''ABCD,EFGH,IJKL,34.23;562.45;213.5432
MNOP,QRST,UVWX,56.23;63.45;625.234
'''*10**6

def iterstream(iterable, buffer_size=io.DEFAULT_BUFFER_SIZE):
    """
    http://stackoverflow.com/a/20260030/190597 (Mechanical snail)
    Lets you use an iterable (e.g. a generator) that yields bytestrings as a
    read-only input stream.

    The stream implements Python 3's newer I/O API (available in Python 2's io
    module).

    For efficiency, the stream is buffered.
    """
    class IterStream(io.RawIOBase):
        def __init__(self):
            self.leftover = None
        def readable(self):
            return True
        def readinto(self, b):
            try:
                l = len(b)  # We're supposed to return at most this much
                chunk = self.leftover or next(iterable)
                output, self.leftover = chunk[:l], chunk[l:]
                b[:len(output)] = output
                return len(output)
            except StopIteration:
                return 0    # indicate EOF
    return io.BufferedReader(IterStream(), buffer_size=buffer_size)

def replacementgenerator(haystack, needle, replace):
    for s in haystack:
        if s == needle:
            yield str.encode(replace);
        else:
            yield str.encode(s);

csv = pd.read_csv(iterstream(replacementgenerator(x, ";", ",")), usecols=[3, 4, 5])

Note that we convert the string (or its constituent characters) to bytes through str.encode, as this is required for use by Pandas.

This approach is functionally identical to the answer by Daniele except for the fact that we replace values "on-the-fly", as they are requested instead of all in one go.

Max Snijders
  • 374
  • 2
  • 10
1

If this is an option, substituting the character ; with , in the string is faster. I have written the string x to a file test.dat.

def csv_reader_4(x):
    with open(x, 'r') as f:
        a = f.read()
    return pd.read_csv(StringIO(unicode(a.replace(';', ','))), usecols=[3, 4, 5])

The unicode() function was necessary to avoid a TypeError in Python 2.

Benchmarking:

%timeit csv_reader_2('test.dat')  # 1.6 s per loop
%timeit csv_reader_4('test.dat')  # 1.2 s per loop
Daniele
  • 553
  • 2
  • 12
  • This is causing `MemoryError` for me, presumably because it requires reading everything in, effectively, twice? Once into `a` and then into `pd.DataFrame`. – jpp Jan 07 '19 at 11:07
  • I think `a.replace` creates a copy. Unfortunately, I do not see an easy way to avoid this without using more sophisticated tools like `cython` – Daniele Jan 07 '19 at 12:32
1

A very very very fast one, 3.51 is the result, simply just make csv_reader_4 the below, it simply converts StringIO to str, then replaces ; with ,, and reads the dataframe with sep=',':

def csv_reader_4(x):
    with x as fin:
        reader = pd.read_csv(StringIO(fin.getvalue().replace(';',',')), sep=',',header=None)
    return reader

The benchmark:

%timeit csv_reader_4(StringIO(x)) # 3.51 s per loop
U13-Forward
  • 69,221
  • 14
  • 89
  • 114
  • Have you tested *relative* performance on consistent hardware / setup? I see this one of the slower solutions, I have updated my question with benchmarking for this. – jpp Jan 13 '19 at 15:51
  • @jpp Ugh, your timings are different from mine, i am on Windows. – U13-Forward Jan 14 '19 at 03:00
  • @U9-Forward I improved your approach by doing the replacement during `read()` operations: https://stackoverflow.com/a/54176770/6394138 – Leon Jan 14 '19 at 06:47
1

In my environment (Ubuntu 16.04, 4GB RAM, Python 3.5.2) the fastest method was (the prototypical1) csv_reader_5 (taken from U9-Forward's answer) which ran only less than 25% slower than reading the entire CSV file with no conversions. I improved that approach by implementing a filter/wrapper that replaces the char in the read() method:

class SingleCharReplacingFilter:

    def __init__(self, reader, oldchar, newchar):
        def proxy(obj, attr):
            a = getattr(obj, attr)
            if attr in ('read'):
                def f(*args):
                    return a(*args).replace(oldchar, newchar)
                return f
            else:
                return a

        for a in dir(reader):
            if not a.startswith("_") or a == '__iter__':
                setattr(self, a, proxy(reader, a))

def csv_reader_6(x):
    with x as fin:
        return pd.read_csv(SingleCharReplacingFilter(fin, ";", ","),
                            sep=',', header=None, usecols=[3, 4, 5])

The result is a little better performance compared to reading the entire CSV file with no conversions:

In [3]: %timeit pd.read_csv(StringIO(x))
605 ms ± 3.24 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [4]: %timeit csv_reader_5(StringIO(x))
733 ms ± 3.49 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [5]: %timeit csv_reader_6(StringIO(x))
568 ms ± 2.98 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

1 I call it prototypical because it assumes that the input stream is of StringIO type (since it calls .getvalue() on it).

Leon
  • 31,443
  • 4
  • 72
  • 97
  • On Python 3.7, Pandas 0.23.4, I'm getting `ValueError: Invalid file path or buffer object type: ` on the `pd.read_csv` line. Any ideas? – jpp Jan 15 '19 at 23:26
  • @jpp Pandas 0.23.4 has an extra requirement for an object to be considered file-like it must have an `__iter__` method. I updated my answer to reflect that. – Leon Jan 16 '19 at 08:52
  • Sorry for the delay. I timed this, it took 1s longer than `csv_reader_1` on my setup (4.28s for `_1` vs 5.28s for `_6`). I'm using input `x = """..."""*10**6` as per my question, Python 3.7.0, Pandas 0.23.4, Windows. I understand this is going to be platform / setup dependent. – jpp Jan 20 '19 at 04:47
1

Python has powerfull features to manipulate data, but don't expect performance using python.When performance is needed , C and C++ are your friend . Any fast library in python is written in C/C++. It is quite easy to use C/C++ code in python, have a look at swig utility (http://www.swig.org/tutorial.html) . You can write a c++ class that may contain some fast utilities that you will use in your python code when needed.

sancelot
  • 1,905
  • 12
  • 31