4

Goal

I want to read in a csv to a DASK dataframe without getting “Unnamed: 0” column.

CODE

mydtype = {'col1': 'object',
           'col2': 'object',
           'col3': 'object',
           'col4': 'float32',}


do = dd.read_csv('/folder/somecsvname.csv', 
                 dtype = mydtype, 
                 low_memory=False,
                 parse_dates=['col3'],
                )

Result Columns

  • Unnamed: 0
  • col1
  • col2
  • col3
  • col4

Tried solutions

  • 1.works with pandas not with dask - pd.read_csv add column named "Unnamed: 0
  • 2.works with pandas not with dask - How to get rid of "Unnamed: 0" column in a pandas DataFrame?
  • CODE added to read in: index_col=False ERROR message: ValueError: Keywords 'index' and 'index_col' not supported. Use dd.read_csv(...).set_index('my-index') instead
  • CODE added to read in: index_col=0 ERROR message: ValueError: Keywords 'index' and 'index_col' not supported. Use dd.read_csv(...).set_index('my-index') instead
  • CODE that recommended by previouse 2 error messages-> DISFUCTION: this just sets up a value as an index but still generates that 'Unnamed: 0' column
do = dd.read_csv('/folder/somecsvname.csv', 
                 dtype = mydtype, 
                 low_memory=False,
                 parse_dates=['col3'],
                ).set_index('col3')
  • CODE added to read in: index_col=None ERROR message: ValueError: Keywords 'index' and 'index_col' not supported. Use dd.read_csv(...).set_index('my-index') instead
  • CODE added to read in: index_col=None, header=0 ERROR message: ValueError: Keywords 'index' and 'index_col' not supported. Use dd.read_csv(...).set_index('my-index') instead
sogu
  • 2,738
  • 5
  • 31
  • 90

2 Answers2

4

The problem is that this column (Unnamed: 0) is present in the original csv file. It's best to address it upstream, at the time this file is generated. If that's not possible, then the best you can do with dask.dataframe is:

ddf = dd.read_csv(my_file)
ddf = ddf.drop('Unnamed: 0', axis=1)

Here's a reproducible example:

import dask.dataframe as dd
import pandas as pd

df = pd.DataFrame(range(5))
df.to_csv('abc.csv')

ddf = dd.read_csv('abc.csv')
ddf = ddf.drop('Unnamed: 0', axis=1)
SultanOrazbayev
  • 14,900
  • 3
  • 16
  • 46
  • Yep so currently this is how I solve it, but it would be great if it wouldn't even create the column and that is the goal of the post. – sogu Feb 24 '21 at 15:05
  • 1
    Yes, that's not possible with `dask.dataframe`. You could achieve something similar with `delayed`, but it's really not worth it, because of the introduced code complexity. – SultanOrazbayev Feb 24 '21 at 15:21
  • Add `df = df.reset_index(drop=True)` – Darren Weber Oct 19 '22 at 02:20
  • @DarrenWeber: AFAIU, `.reset_index` would work if `Unnamed: 0` was used as the index column by `dd.read_csv`. However, the [current API doesn't allow specifying an index column](https://docs.dask.org/en/stable/generated/dask.dataframe.read_csv.html). – SultanOrazbayev Oct 19 '22 at 13:58
  • I mean, after the `ddf.drop(...)`, I found it useful to reset the index after computing the pandas DataFrame i.e.: `df = ddf.compute().reset_index(drop=True)` – Darren Weber Oct 19 '22 at 21:28
  • I see, that's definitely useful to get unique index for df that fit into memory, however for large dfs this might not be feasible. – SultanOrazbayev Oct 20 '22 at 02:26
2

Try to add these 2 combnations in the read_csv function:

index_col=None
index_col=None, header=0
FedericoSala
  • 148
  • 6