Read a large TAD data with Dask

Question

I have a very large dataframe. It was originally a TAD file. Someone saved it with CSV extension.

I was trying to read it in pandas but it takes hours, even with chunksize parameter.

start = time.time()
#read data in chunks of 1 million rows at a time
chunk = pd.read_csv(
  '/.../estrapola_articoli.csv',
  sep='\t',
  lineterminator='\r',
  chunksize=1000000) # <-- here
end = time.time()
print("Read csv with chunks: ",(end-start),"sec")
articoli = pd.concat(chunk)

I've read about Dask and I've tried the following:

import dask
import dask.dataframe as dd

df = dd.read_csv(
  '/.../estrapola_articoli.csv',
  sep='\t',
  lineterminator='\r')

Unfortunately, I've got this error

ValueError: Sample is not large enough to include at least one row of data. Please increase the number of bytes in sample in the call to read_csv/read_table

The above exception was the direct cause of the following exception:

ValueError Traceback (most recent call last) /usr/local/lib/python3.10/dist-packages/dask/backends.py in wrapper(*args, **kwargs) 125 return func(*args, **kwargs) 126 except Exception as e: --> 127 raise type(e)( 128 f"An error occurred while calling the {funcname(func)} " 129 f"method registered to the {self.backend} backend.\n"

ValueError: An error occurred while calling the read_csv method registered to the pandas backend. Original Message: Sample is not large enough to include at least one row of data. Please increase the number of bytes in sample in the call to read_csv/read_table

So I used sample:

import dask.dataframe as dd

df = dd.read_csv(
  '/.../estrapola_articoli.csv',
  sep='\t',
  lineterminator='\r',
  sample=1000000)  # 1MB

It gives me the same error. I could try significantly increasing the sample size further, but this could lead to inefficient computations if the sample is too large.

Any help to read this file?

Reading in big data from disk is never going to be efficient. However, I recently tried `polars` library, and am very impressed so far. It works similarly to `pandas`, but leverages parallel operations amongst other efficiency and performance gains. I tested with a ~4gb csv, and `polars` read it about 4.5x faster than `pandas`. You could give it a try with `scan_csv` to work through the data without reading it all into memory at once — Pep_8_Guardiola, May 12 '23 at 10:04
I gave it a try, but I cannot specify parameter like , sep='\t', lineterminator='\r'. Maybe I should open a new question — coelidonum, May 12 '23 at 10:18
Check the documentation here: https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.scan_csv.html . There are `separator` and `eol_char` arguments that can be used if needed — Pep_8_Guardiola, May 12 '23 at 10:22
Question: we don't know what TAD is, is it really possible that your lines are so long? — mdurant, May 12 '23 at 13:19

Read a large TAD data with Dask

0 Answers0