How to read a compressed (gz) CSV file into a dask Dataframe?

Question

Is there a way to read a .csv file that is compressed via gz into a dask dataframe?

I've tried it directly with

import dask.dataframe as dd
df = dd.read_csv("Data.gz" )

but get an unicode error (probably because it is interpreting the compressed bytes) There is a "compression" parameter but compression = "gz" won't work and I can't find any documentation so far.

With pandas I can read the file directly without a problem other than the result blowing up my memory ;-) but if I restrict the number of lines it works fine.

import pandas.Dataframe as pd
df = pd.read_csv("Data.gz", ncols=100)

Well, the regular pandas (non-dask) reads is fine without any encoding set, so my guess would be that dask tries to read the compressed gz file directly as an ascii file and gets non-sense. — Magellan88, Oct 07 '16 at 20:06

score 26 · Answer 1 · answered Sep 26 '17 at 14:14

Panda's current documentation says:

compression : {‘infer’, ‘gzip’, ‘bz2’, ‘zip’, ‘xz’, None}, default ‘infer’

Since 'infer' is the default, that would explain why it is working with pandas.

Dask's documentation on the compression argument:

String like ‘gzip’ or ‘xz’. Must support efficient random access. Filenames with extensions corresponding to known compression algorithms (gz, bz2) will be compressed accordingly automatically

That would suggest that it should also infer the compression for at least gz. That it doesn't (and it still does not in 0.15.3) may be a bug. However, it is working using compression='gzip'.

i.e.:

import dask.dataframe as dd
df = dd.read_csv("Data.gz", compression='gzip')

score 13 · Accepted Answer · edited Jul 17 '19 at 23:53

13

It's actually a long-standing limitation of dask. Load the files with dask.delayed instead:

import pandas as pd
import dask.dataframe as dd
from dask.delayed import delayed

filenames = ...
dfs = [delayed(pd.read_csv)(fn) for fn in filenames]

df = dd.from_delayed(dfs) # df is a dask dataframe

edited Jul 17 '19 at 23:53

ricoms

952
15
22

answered Sep 20 '18 at 04:16

Christian Alis

6,556
5
31
29

2

I believe the question was about a single gz (which works) not zip file (zip was mentioned as a limitation in the linked GitHub issue). Is there still any advantage using delayed in this case? – de1 Oct 02 '18 at 21:47
Sorry, I missed that. I wanted to delete my answer but I couldn't because it's the accepted answer. – Christian Alis Oct 08 '18 at 16:38
1

btw: zip will be supported as soon as https://github.com/dask/dask/pull/5064 goes in – mdurant Jul 18 '19 at 15:05

score 1 · Answer 3 · answered Oct 07 '16 at 20:04

1

Without the file it's difficult to say. what if you set the encoding like # -*- coding: latin-1 -*-? or since read_csv is based off of Pandas, you may even dd.read_csv('Data.gz', encoding='utf-8'). Here's the list of Python encodings: https://docs.python.org/3/library/codecs.html#standard-encodings

answered Oct 07 '16 at 20:04

Dervin Thunk

19,515
28
127
217

well, good idea, but still get the error: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte. When I decompress the file on disk and read it it almost works but for complaints about NaN types – Magellan88 Oct 07 '16 at 20:09
@Magellan88: how about adding `error_bad_lines=False` – Dervin Thunk Oct 07 '16 at 20:32

How to read a compressed (gz) CSV file into a dask Dataframe?

3 Answers3

Linked