0

Just out of curiosity, if dask enables both len() and size, why is there not shape as well?

jpp
  • 159,742
  • 34
  • 281
  • 339
Maria
  • 159
  • 10
  • This may be a design choice or an oversight. Unless you can catch the guy wrote the library, you are likely to only find opinions here (not too helpful). – jpp Jun 28 '18 at 10:39
  • 1
    @jpp The guy who wrote it is really active on SO. – rpanai Jun 28 '18 at 12:21
  • 1
    @user32185, Yup, I know, he's really helpful too, which is why I commented :). – jpp Jun 28 '18 at 13:30
  • Related https://stackoverflow.com/questions/50355598/how-should-i-get-the-shape-of-a-dask-dataframe – mdurant Jun 28 '18 at 14:17

1 Answers1

2

This has been discussed in dask. First I'll point out that in the python spec, len() is always supposed to return a concrete integer. Dask respects this, and so len(df) blocks, unlike most operations on a data-frame. There is no such constraint on .size, which is therefore lazy.

The metadata of the dataframe is immediately available, however, the number, names and types of the columns are known without computing any of the data. Therefore, .shape would be a combination of a known value and either a lazy or a slowly-computed concrete value. This doesn't seem necessary.

mdurant
  • 27,272
  • 5
  • 45
  • 74