How to read a large file as Pandas dataframe?

Question

I want to read a large file (4GB) as a Pandas dataframe. Since using Dask directly still consumes maximum CPU, I read the file as a pandas dataframe, then use dask_cudf, and then convert back to a pandas dataframe.

However, my code is still using maximum CPU on Kaggle. GPU accelerator is switched on.

import pandas as pd 
from dask import dataframe as dd
from dask_cuda import LocalCUDACluster
from dask.distributed import Client
cluster = LocalCUDACluster()
client = Client(cluster)

df = pd.read_csv("../input/subtype-nt/meth_subtype_normal_tumor.csv", sep="\t", index_col=0)
ddf = dask_cudf.from_cudf(df, npartitions=2)
meth_sub_nt = ddf.infer_objects()

Reading a 4GB CSV file is going to consume a lot of CPU time. If your goal is to consume less CPU time, can you change to using binary files like Parquet or those produced by `np.save()`? https://numpy.org/doc/stable/reference/generated/numpy.save.html — John Zwinck, Jul 31 '22 at 14:50
I want to reduce both CPU and RAM. Will `np.save()` achieve both? Thanks. — melolilili, Jul 31 '22 at 14:55
Your data is large. No way around that. Because of this, you *must* make some trade-offs. There is no library that will allow you to bring 4gb into memory without using at least that much memory. The dask local approach is to split up your *downstream* workflow into chunks which can be processed bit by bit, so you never have the whole thing in memory. You’ll have to restructure your workflow - there’s no point when you’ll be able to magically load all the data. But yes saving your data in a binary format (I’d use parquet) is much more memory and time efficient regardless of your approach. — Michael Delgado, Jul 31 '22 at 16:48
Also see this answer for general tips on working with csvs: https://stackoverflow.com/a/69153327/3888719 — Michael Delgado, Jul 31 '22 at 16:51
Your goal is to have the dataframe in memory on the GPU? Do you know about https://docs.rapids.ai/api/cudf/nightly/api_docs/api/cudf.read_csv.html o dask-cudf's version? — mdurant, Aug 01 '22 at 00:58

Abhinav · Answer 1 · 2022-07-31T15:19:34.573

I have had similar problem. With some research, I came to know about Vaex.

You can read about its performance here and here.

Essentially this is what you can try to do:

Read the csv file using Vaex and convert it to a hdf5 file (file format most optimised for Vaex)

vaex_df = vaex.from_csv('../input/subtype-nt/meth_subtype_normal_tumor.csv', convert=True, chunk_size=5_000)

Open the hdf5 file using Vaex. Vaex will do the memory-mapping and thus will not load data into memory.
```
vaex_df = vaex.open('../input/subtype-nt/meth_subtype_normal_tumor.csv.hdf5')
```

Now you can perform operations on your Vaex dataframe just like you would be doing with Pandas. It will be blazingly fast and you will certainly notice huge performance gains (lower CPU and memory usage).

You can also try to read your csv file directly into Vaex dataframe without converting it to hdf5. I had read somewhere that Vaex works fastest with hdf5 files therefore I suggested the above approach.

vaex_df = vaex.from_csv('../input/subtype-nt/meth_subtype_normal_tumor.csv.hdf5', chunk_size=5_000)

At which point does the above solution exceeds RAM? You can control the ram usage of the `vaex.from_csv` via the `chunk_size` parameter. After that, as the solutions quotes, you do memory mapping stuff so you should be fine.. I am curious to know since I've done this many times and I've never had any problems.. — Joco, Aug 01 '22 at 19:31

score 1 · Answer 2 · answered Aug 01 '22 at 04:44

Right now your code suggests that you first attempt to load data using pandas and then convert it to dask-cuDF dataframe. That's not optimal (or might not even be feasible). Instead, one can use dask_cudf.read_csv function (see docs):

from dask_cudf import read_csv

ddf = read_csv('example_output/foo_dask.csv')

How to read a large file as Pandas dataframe?

2 Answers2