How to read in csv with to to a DASK dataframe so it will not have “Unnamed: 0” column?

Question

Goal

I want to read in a csv to a DASK dataframe without getting “Unnamed: 0” column.

CODE

mydtype = {'col1': 'object',
           'col2': 'object',
           'col3': 'object',
           'col4': 'float32',}


do = dd.read_csv('/folder/somecsvname.csv', 
                 dtype = mydtype, 
                 low_memory=False,
                 parse_dates=['col3'],
                )

Result Columns

Unnamed: 0
col1
col2
col3
col4

Tried solutions

1.works with pandas not with dask - pd.read_csv add column named "Unnamed: 0
2.works with pandas not with dask - How to get rid of "Unnamed: 0" column in a pandas DataFrame?
CODE added to read in: index_col=False ERROR message: ValueError: Keywords 'index' and 'index_col' not supported. Use dd.read_csv(...).set_index('my-index') instead
CODE added to read in: index_col=0 ERROR message: ValueError: Keywords 'index' and 'index_col' not supported. Use dd.read_csv(...).set_index('my-index') instead
CODE that recommended by previouse 2 error messages-> DISFUCTION: this just sets up a value as an index but still generates that 'Unnamed: 0' column

do = dd.read_csv('/folder/somecsvname.csv', 
                 dtype = mydtype, 
                 low_memory=False,
                 parse_dates=['col3'],
                ).set_index('col3')

CODE added to read in: index_col=None ERROR message: ValueError: Keywords 'index' and 'index_col' not supported. Use dd.read_csv(...).set_index('my-index') instead
CODE added to read in: index_col=None, header=0 ERROR message: ValueError: Keywords 'index' and 'index_col' not supported. Use dd.read_csv(...).set_index('my-index') instead

score 4 · Accepted Answer · answered Feb 24 '21 at 12:13

4

The problem is that this column (Unnamed: 0) is present in the original csv file. It's best to address it upstream, at the time this file is generated. If that's not possible, then the best you can do with dask.dataframe is:

ddf = dd.read_csv(my_file)
ddf = ddf.drop('Unnamed: 0', axis=1)

Here's a reproducible example:

import dask.dataframe as dd
import pandas as pd

df = pd.DataFrame(range(5))
df.to_csv('abc.csv')

ddf = dd.read_csv('abc.csv')
ddf = ddf.drop('Unnamed: 0', axis=1)

answered Feb 24 '21 at 12:13

SultanOrazbayev

14,900
3
16
46

Yep so currently this is how I solve it, but it would be great if it wouldn't even create the column and that is the goal of the post. – sogu Feb 24 '21 at 15:05
1

Yes, that's not possible with `dask.dataframe`. You could achieve something similar with `delayed`, but it's really not worth it, because of the introduced code complexity. – SultanOrazbayev Feb 24 '21 at 15:21
Add `df = df.reset_index(drop=True)` – Darren Weber Oct 19 '22 at 02:20
@DarrenWeber: AFAIU, `.reset_index` would work if `Unnamed: 0` was used as the index column by `dd.read_csv`. However, the [current API doesn't allow specifying an index column](https://docs.dask.org/en/stable/generated/dask.dataframe.read_csv.html). – SultanOrazbayev Oct 19 '22 at 13:58
I mean, after the `ddf.drop(...)`, I found it useful to reset the index after computing the pandas DataFrame i.e.: `df = ddf.compute().reset_index(drop=True)` – Darren Weber Oct 19 '22 at 21:28
I see, that's definitely useful to get unique index for df that fit into memory, however for large dfs this might not be feasible. – SultanOrazbayev Oct 20 '22 at 02:26

score 2 · Answer 2 · answered Feb 24 '21 at 11:37

2

Try to add these 2 combnations in the read_csv function:

index_col=None
index_col=None, header=0

answered Feb 24 '21 at 11:37

FedericoSala

148
6

it have given the same error message but thx. – sogu Feb 24 '21 at 15:05

How to read in csv with to to a DASK dataframe so it will not have “Unnamed: 0” column?

2 Answers2