I have a very large dataframe. It was originally a TAD file. Someone saved it with CSV extension.
I was trying to read it in pandas but it takes hours, even with chunksize parameter.
start = time.time()
#read data in chunks of 1 million rows at a time
chunk = pd.read_csv(
'/.../estrapola_articoli.csv',
sep='\t',
lineterminator='\r',
chunksize=1000000) # <-- here
end = time.time()
print("Read csv with chunks: ",(end-start),"sec")
articoli = pd.concat(chunk)
I've read about Dask and I've tried the following:
import dask
import dask.dataframe as dd
df = dd.read_csv(
'/.../estrapola_articoli.csv',
sep='\t',
lineterminator='\r')
Unfortunately, I've got this error
ValueError: Sample is not large enough to include at least one row of data. Please increase the number of bytes in
sample
in the call toread_csv
/read_table
The above exception was the direct cause of the following exception:
ValueError Traceback (most recent call last) /usr/local/lib/python3.10/dist-packages/dask/backends.py in wrapper(*args, **kwargs) 125 return func(*args, **kwargs) 126 except Exception as e: --> 127 raise type(e)( 128 f"An error occurred while calling the {funcname(func)} " 129 f"method registered to the {self.backend} backend.\n"
ValueError: An error occurred while calling the read_csv method registered to the pandas backend. Original Message: Sample is not large enough to include at least one row of data. Please increase the number of bytes in
sample
in the call toread_csv
/read_table
So I used sample:
import dask.dataframe as dd
df = dd.read_csv(
'/.../estrapola_articoli.csv',
sep='\t',
lineterminator='\r',
sample=1000000) # 1MB
It gives me the same error. I could try significantly increasing the sample size further, but this could lead to inefficient computations if the sample is too large.
Any help to read this file?