7

I am trying to use dask instead of pandas since I have 2.6gb csv file. I load it and I want to drop a column. but it seems that neither the drop method df.drop('column') or slicing df[ : , :-1]

is implemented yet. Is this the case or am I just missing something ?

chrisfs
  • 6,182
  • 6
  • 29
  • 35

2 Answers2

9

We implemented the drop method in this PR. This is available as of dask 0.7.0.

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({'x': [1, 2, 3], 'y': [3, 2, 1]})

In [3]: import dask.dataframe as dd

In [4]: ddf = dd.from_pandas(df, npartitions=2)

In [5]: ddf.drop('y', axis=1).compute()
Out[5]: 
   x
0  1
1  2
2  3

Previously one could also have used slicing with column names; though of course this can be less attractive if you have many columns.

In [6]: ddf[['x']].compute()
Out[6]: 
   x
0  1
1  2
2  3
MRocklin
  • 55,641
  • 23
  • 163
  • 235
  • Why ".compute()"? If your database is very large, doesn't this slow you down?? – FaCoffee Oct 28 '17 at 15:59
  • 1
    I only use compute above to show results of the computation. You're correct that calling compute prematurely can be suboptimal. – MRocklin Oct 29 '17 at 18:28
0

This should work:

print(ddf.shape)
ddf = ddf.drop(columns, axis=1)
print(ddf.shape)
Fares Sayah
  • 121
  • 1
  • 5