0

I have a csv that I am reading into a Pandas DataFrame but it takes about 35 minutes to read. The csv is approximately 120 GB. I found a module called cudf that allows a GPU DataFrame however it is only for Linux. Is there something similar for Windows?

chunk_list = []
combined_array = pd.DataFrame()
for chunk in tqdm(pd.read_csv('\\large_array.csv', header = None, 
        low_memory = False, error_bad_lines = False, chunksize = 10000)):
    print(' --- Complete')
    chunk_list.append(chunk)
array = pd.concat(chunk_list)
print(array)
rzaratx
  • 756
  • 3
  • 9
  • 29
  • 1
    You should consider an alternative serialization format, csv is not really designed for performance. What is the nature of your data? – juanpa.arrivillaga Nov 13 '19 at 00:05
  • 1
    as @juanpa.arrivillaga mentioned you should save your date in a `.hd5` file. It will be way faster to then load it into a dataframe. – Teddy Nov 13 '19 at 00:12
  • My data consists of 10's of thousands of 1 line arrays. – rzaratx Nov 13 '19 at 01:09
  • I've been trying to save as `.h5` but the code keeps telling me that there is no module named `tables` even though I pip installed it. – rzaratx Nov 13 '19 at 01:11
  • @RicardoZaragoza what does that mean? 1 line arrays? CSV is *text*, what exactly do you mean by arrays? Also, if you are trying to use hd5, show the code that is failing. Did you try using `pandas.DataFrame.to_hdf`? – juanpa.arrivillaga Nov 13 '19 at 01:17
  • They are just numbers separated by commas. I've been trying to use `array.to_hdf('large_array.h5', key='df', mode='w')`. I have checked the pytables.org prerequisites and installed everything it recommeded. – rzaratx Nov 13 '19 at 01:27
  • I `import tables` at the beginning of my code but i get the error `ModuleNotFoundError: No module named 'tables'` – rzaratx Nov 13 '19 at 01:32
  • This sounds like an entirely different problem, but anyway, likely the speedup from hdf will be significant. However, do you actually need a pandas dataframe? or can you just use a `numpy.ndarray`, because then you could just save it using the native numpy serialization format which would be quite fast as well. – juanpa.arrivillaga Nov 13 '19 at 01:35
  • I figured it out. I `import h5py` and save using `array = h5py.File('large_array.h5', mode='w')`. I'll see if that improves loading time. – rzaratx Nov 13 '19 at 01:46
  • read it via python datatable and convert to pandas frame – jangorecki Nov 14 '19 at 04:16
  • Does this answer your question? [Fastest way to parse large CSV files in Pandas](https://stackoverflow.com/questions/25508510/fastest-way-to-parse-large-csv-files-in-pandas) – Michael Delgado Sep 12 '21 at 18:08

2 Answers2

1

You can also look at dask-dataframe if you really want to read it into a pandas api like dataframe.

For reading csvs , this will parallelize your io task across multiple cores plus nodes. This will probably alleviate memory pressures by scaling across nodes as with 120 GB csv you will probably be memory bound too.

Another good alternative might be using arrow.

Vibhu Jawa
  • 88
  • 9
1

Do you have GPU ? if yes, please look at BlazingSQL, the GPU SQL engine in a Python package.

In this article, describe Querying a Terabyte with BlazingSQL. And BlazingSQL support read from CSV.

After you get GPU dataframe convert to Pandas dataframe with

# from cuDF DataFrame to pandas DataFrame
df = gdf.to_pandas()
Pamungkas Jayuda
  • 1,194
  • 2
  • 13
  • 31