Efficient way to read 15 M lines csv files in python

Question

For my application, I need to read multiple files with 15 M lines each, store them in a DataFrame, and save the DataFrame in HDFS5 format.

I've already tried different approaches, notably pandas.read_csv with chunksize and dtype specifications, and dask.dataframe. They both take around 90 seconds to treat 1 file, and so I'd like to know if there's a way to efficiently treat these files in the described way. In the following, I show some code of the tests I've done.

import pandas as pd
import dask.dataframe as dd
import numpy as np
import re 

# First approach
store = pd.HDFStore('files_DFs.h5')

chunk_size = 1e6

df_chunk = pd.read_csv(file,
                sep="\t",
                chunksize=chunk_size,
                usecols=['a', 'b'],
                converters={"a": lambda x: np.float32(re.sub(r"[^\d.]", "", x)),\
                            "b": lambda x: np.float32(re.sub(r"[^\d.]", "", x))},
                skiprows=15
           )              
chunk_list = [] 


for chunk in df_chunk:
      chunk_list.append(chunk)


df = pd.concat(chunk_list, ignore_index=True)

store[dfname] = df
store.close()

# Second approach

df = dd.read_csv(
        file,
        sep="\t",
        usecols=['a', 'b'],
        converters={"a": lambda x: np.float32(re.sub(r"[^\d.]", "", x)),\
                    "b": lambda x: np.float32(re.sub(r"[^\d.]", "", x))},
        skiprows=15
     )
store.put(dfname, df.compute())
store.close()

Here is what the files look like (whitespace consists of a literal tab):

a   b
599.998413  14.142895
599.998413  20.105534
599.998413  6.553850
599.998474  27.116098
599.998474  13.060312
599.998474  13.766775
599.998596  1.826706
599.998596  18.275938
599.998718  20.797491
599.998718  6.132450)
599.998718  41.646194
599.998779  19.145775

I got `size is too big (>30 MB)` error. You may add 5-10 lines right in the question body. — , Jul 02 '19 at 06:33
why do you read an XML with the read_csv method ? And if this code works (it doesn't on my computer) just remove those regexes and compile them before (or even better, use str.replace instead) — Thibault D., Jul 02 '19 at 08:40
In fact, I read .txt files. I just inserted some values as an example in this format. The regexes are used because the files may contain some values like "10.042)", and so I do not want to read the ")". — Gabriel Dante, Jul 02 '19 at 08:49
Is it correct that file you read has header `Row num. a b` and lines like `15359970 599.998413 14.142895`? Spaces between values stand for TABs here. — , Jul 02 '19 at 09:20
I guess you may get huge speed boost with `decimal='.'` instead of `converters` argument. — , Jul 02 '19 at 09:38
Sorry, corrected as it is now. The file has the headers "a" and "b". The row numbers appear when I import them as DataFrames. As for the values, the first column corresponds to the time and the second to a sensor aquisition. — Gabriel Dante, Jul 02 '19 at 09:38
As for the decimal, I've read in the read_csv documentation that decimal is already by default set to '.', so I'm not sure it would change the speed. I see the problem may be in the conversion, but I have to do it one way or anoter. — Gabriel Dante, Jul 02 '19 at 09:41
If the file is a text file (what it should as you use `read_csv` please show its content as text, meaning as it appears in a text editor like notepad, notepad++ or vi. — Serge Ballesta, Jul 03 '19 at 08:23
I attempted to revert your crazy XML formatting experiment. Please check the result. In particular, if your actual CSV data contains quoting around the values, please add it back. We really need your sample data to be *exactly* like the actual data you are attempting to process. — tripleee, Jul 03 '19 at 08:34
So no trailing parentheses or anything like that, in spite of what you said in an earlier comment? — tripleee, Jul 03 '19 at 08:40
I have no problems reading the entire file in one go: `pd.read_csv("data2.txt", sep="\t")`. It takes 3.5 seconds to read one file of 15M lines. — Nils Werner, Jul 03 '19 at 14:14
I see, the big problem here is the converter that applies the operation for each one of the 15 M lines. That's what makes it slow... — Gabriel Dante, Jul 03 '19 at 14:36
@GabrielDante Please mark my answer as accepted, or if it did not answer the question, tell me what to improve. — polvoazul, Jul 17 '19 at 04:52

polvoazul · Accepted Answer · 2019-09-23T14:25:41.473

First, lets answer the title of the question

1- How to eficiently read 15M lines of a csv containing floats

I suggest you use modin:

Generating sample data:

import modin.pandas as mpd
import pandas as pd
import numpy as np

frame_data = np.random.randint(0, 10_000_000, size=(15_000_000, 2)) 
pd.DataFrame(frame_data*0.0001).to_csv('15mil.csv', header=False)

!wc 15mil*.csv ; du -h 15mil*.csv

    15000000   15000000  480696661 15mil.csv
    459M    15mil.csv

Now to the benchmarks:

%%timeit -r 3 -n 1 -t
global df1
df1 = pd.read_csv('15mil.csv', header=None)
    9.7 s ± 95.1 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)

%%timeit -r 3 -n 1 -t
global df2
df2 = mpd.read_csv('15mil.csv', header=None)
    3.07 s ± 685 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)

(df2.values == df1.values).all()
    True

So as we can see modin was approximatly 3 times faster on my setup.

Now to answer your specific problem

2- Cleaning a csv file that contains non numeric characters, and then reading it

As people have noted, your bottleneck is probably the converter. You are calling those lambdas 30 Million times. Even the function call overhead becomes non-trivial at that scale.

Let's attack this problem.

Generating dirty dataset:

!sed 's/.\{4\}/&)/g' 15mil.csv > 15mil_dirty.csv

Approaches

First, I tried using modin with the converters argument. Then, I tried a different approach that calls the regexp less times:

First I will create a File-like object that filters everything through your regexp:

class FilterFile():
    def __init__(self, file):
        self.file = file
    def read(self, n):
        return re.sub(r"[^\d.,\n]", "", self.file.read(n))
    def write(self, *a): return self.file.write(*a) # needed to trick pandas
    def __iter__(self, *a): return self.file.__iter__(*a) # needed

Then we pass it to pandas as the first argument in read_csv:

with open('15mil_dirty.csv') as file:
    df2 = pd.read_csv(FilterFile(file))

Benchmarks:

%%timeit -r 1 -n 1 -t
global df1
df1 = pd.read_csv('15mil_dirty.csv', header=None,
        converters={0: lambda x: np.float32(re.sub(r"[^\d.]", "", x)),
                    1: lambda x: np.float32(re.sub(r"[^\d.]", "", x))}
           )
    2min 28s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

%%timeit -r 1 -n 1 -t
global df2
df2 = mpd.read_csv('15mil_dirty.csv', header=None,
        converters={0: lambda x: np.float32(re.sub(r"[^\d.]", "", x)),
                    1: lambda x: np.float32(re.sub(r"[^\d.]", "", x))}
           )
    38.8 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

%%timeit -r 1 -n 1 -t
global df3
df3 = pd.read_csv(FilterFile(open('15mil_dirty.csv')), header=None,)
    1min ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

Seems like modin wins again! Unfortunatly modin has not implemented reading from buffers yet so I devised the ultimate approach.

The Ultimate Approach:

%%timeit -r 1 -n 1 -t
with open('15mil_dirty.csv') as f, open('/dev/shm/tmp_file', 'w') as tmp:
    tmp.write(f.read().translate({ord(i):None for i in '()'}))
df4 = mpd.read_csv('/dev/shm/tmp_file', header=None)
    5.68 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

This uses translate which is considerably faster than re.sub, and also uses /dev/shm which is in-memory filesystem that ubuntu (and other linuxes) usually provide. Any file written there will never go to disk, so it is fast. Finally, it uses modin to read the file, working around modin's buffer limitation. This approach is about 30 times faster than your approach, and it is pretty simple, also.

score 2 · Answer 2 · answered Jul 03 '19 at 13:42

Well my findings are not much related to pandas, but rather some common pitfalls.

Your code: 
(genel_deneme) ➜  derp time python a.py
python a.py  38.62s user 0.69s system 100% cpu 39.008 total

precompile your regex

Replace re.sub(r"[^\d.]", "", x) with precompiled version and use it in your lambdas
Result : 
(genel_deneme) ➜  derp time python a.py 
python a.py  26.42s user 0.69s system 100% cpu 26.843 total

Try to find a better way then directly using np.float32, since it's 6-10 times slower than i think you expect it to be. Following is not what you want, but i just want to show the issue here.

replace np.float32 with float and run your code. 
My Result:  
(genel_deneme) ➜  derp time python a.py
python a.py  14.79s user 0.60s system 102% cpu 15.066 total

Find another way to achieve the result with the floats. More on this issue https://stackoverflow.com/a/6053175/37491

Divide your file and the work to subprocesses if you can. You already work on separate chunks of constant size. So basically you can divide the file and handle the job in separate processes using multiprocessing or threads.