How should I get the shape of a dask dataframe?

Question

Performing .shape is giving me the following error.

AttributeError: 'DataFrame' object has no attribute 'shape'

How should I get the shape instead?

score 36 · Answer 1 · answered May 15 '18 at 17:12

36

You can get the number of columns directly

len(df.columns)  # this is fast

You can also call len on the dataframe itself, though beware that this will trigger a computation.

len(df)  # this requires a full scan of the data

Dask.dataframe doesn't know how many records are in your data without first reading through all of it.

answered May 15 '18 at 17:12

MRocklin

55,641
23
163
235

len(df) is loading all of the records and in my case, finding len(df) for a table at size 144M rows took more than few minutes (wind10,ram16,intel7). Any other way? – Rebin Mar 19 '19 at 21:03
It probably has to load all of the data to find out the length. No, there is no other way. You could consider using something like a database, which tracks this sort of information in metadata. – MRocklin Mar 27 '19 at 05:09
2

i've been doing `df.index.size.compute()` which is faster than running `len(df)` ... but my data is stored in columnar parquet... so it depends on what your underlying data architecture is. – user108569 Aug 22 '19 at 19:41

score 27 · Answer 2 · answered Jul 15 '19 at 13:13

27

With shape you can do the following

a = df.shape
a[0].compute(),a[1]

This will shop the shape just as it is shown with pandas

answered Jul 15 '19 at 13:13

tinashe matambo

271
3
2

score 7 · Answer 3 · edited Aug 25 '20 at 14:58

Well, I know this is a quite old question, but I had the same issue and I got an out-of-the-box solution which I just want to register here.

Considering your data, I'm wondering that it is originally saved in a CSV similar file; so, for my situation, I just count the lines of that file (minus one, the header line). Inspired by this answer here, this is the solution I'm using:

import dask.dataframe as dd
from itertools import (takewhile,repeat)
 
def rawincount(filename):
    f = open(filename, 'rb')
    bufgen = takewhile(lambda x: x, (f.raw.read(1024*1024) for _ in repeat(None)))
    return sum( buf.count(b'\n') for buf in bufgen )

filename = 'myHugeDataframe.csv'
df = dd.read_csv(filename)
df_shape = (rawincount(filename) - 1, len(df.columns))
print(f"Shape: {df_shape}")

Hope this could help someone else as well.

This approach is very fast and take an advantage of distributed processing in dask — Apichart Thanomkiet, Jan 14 '20 at 22:33
Thank you! This is faster than the other possible solution of loading a single columns and obtaining its length. — Gabriel, Jan 31 '21 at 14:25

score 3 · Answer 4 · answered Sep 02 '20 at 21:37

3

print('(',len(df),',',len(df.columns),')')

answered Sep 02 '20 at 21:37

Omid Erfanmanesh

547
1
7
29

score 1 · Answer 5 · answered Nov 17 '18 at 10:36

1

To get the shape we can try this way:

 dask_dataframe.describe().compute()

"count" column of the index will give the number of rows

 len(dask_dataframe.columns)

this will give the number of columns in the dataframe

answered Nov 17 '18 at 10:36

Jyothish Arumugam

25
2

score -2 · Answer 6 · answered Apr 12 '21 at 10:01

-2

Getting number of columns by below code.

import dask.dataframe as dd
dd1=dd.read_csv("filename.txt")
print(dd1.info)

#Output
<class 'dask.dataframe.core.DataFrame'>
Columns: 6 entries, CountryName to Value
dtypes: object(4), float64(1), int64(1)

answered Apr 12 '21 at 10:01

sameer_nubia

721
8
8

in Pandas, shape will output both number of rows and columns. I don't think showing number of columns answers OP's question. – panc May 18 '21 at 21:21
Columns: 6 entries What is this in output and i am using dask FYI. – sameer_nubia May 20 '21 at 07:45

How should I get the shape of a dask dataframe?

6 Answers6

Linked