0

I have a very large dataframe. It was originally a TAD file. Someone saved it with CSV extension.

I was trying to read it in pandas but it takes hours, even with chunksize parameter.

start = time.time()
#read data in chunks of 1 million rows at a time
chunk = pd.read_csv(
  '/.../estrapola_articoli.csv',
  sep='\t',
  lineterminator='\r',
  chunksize=1000000) # <-- here
end = time.time()
print("Read csv with chunks: ",(end-start),"sec")
articoli = pd.concat(chunk)

I've read about Dask and I've tried the following:

import dask
import dask.dataframe as dd

df = dd.read_csv(
  '/.../estrapola_articoli.csv',
  sep='\t',
  lineterminator='\r') 

Unfortunately, I've got this error

ValueError: Sample is not large enough to include at least one row of data. Please increase the number of bytes in sample in the call to read_csv/read_table

The above exception was the direct cause of the following exception:

ValueError Traceback (most recent call last) /usr/local/lib/python3.10/dist-packages/dask/backends.py in wrapper(*args, **kwargs) 125 return func(*args, **kwargs) 126 except Exception as e: --> 127 raise type(e)( 128 f"An error occurred while calling the {funcname(func)} " 129 f"method registered to the {self.backend} backend.\n"

ValueError: An error occurred while calling the read_csv method registered to the pandas backend. Original Message: Sample is not large enough to include at least one row of data. Please increase the number of bytes in sample in the call to read_csv/read_table

So I used sample:

import dask.dataframe as dd

df = dd.read_csv(
  '/.../estrapola_articoli.csv',
  sep='\t',
  lineterminator='\r',
  sample=1000000)  # 1MB

It gives me the same error. I could try significantly increasing the sample size further, but this could lead to inefficient computations if the sample is too large.

Any help to read this file?

Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
coelidonum
  • 523
  • 5
  • 17
  • 1
    Reading in big data from disk is never going to be efficient. However, I recently tried `polars` library, and am very impressed so far. It works similarly to `pandas`, but leverages parallel operations amongst other efficiency and performance gains. I tested with a ~4gb csv, and `polars` read it about 4.5x faster than `pandas`. You could give it a try with `scan_csv` to work through the data without reading it all into memory at once – Pep_8_Guardiola May 12 '23 at 10:04
  • I gave it a try, but I cannot specify parameter like , sep='\t', lineterminator='\r'. Maybe I should open a new question – coelidonum May 12 '23 at 10:18
  • Check the documentation here: https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.scan_csv.html . There are `separator` and `eol_char` arguments that can be used if needed – Pep_8_Guardiola May 12 '23 at 10:22
  • Question: we don't know what TAD is, is it really possible that your lines are so long? – mdurant May 12 '23 at 13:19

0 Answers0