Dask Dataframe shape attribute is giving wrong shape

Question

I'm trying to find the shape of a subset dataframe of a larger dask dataframe. But Instead of getting the right shape (# of rows), I'm getting a wrong value

In the example, I stored the first 3 rows into a new dataframe, when I'm trying to find the shape[0], the output is 4 rather than 3. Is there any way to solve this issue?

data = {'Name':['Tom', 'nick', 'nick', 'krish', 'jack', 'jack'], 'Age':[20, 21, 21, 19, 18, 18]}
df = pd.DataFrame(data)
ddf = dd.from_pandas(df, npartitions = 5)
print(ddf.shape[0].compute()) # --> Outputs 6
    
    
# Only selecting 3 rows
only_3 = ddf.loc[:3,:]
print(only_3.shape[0].compute()) # --> Outputs 4 (Instead of 3)

EDIT:

How did I miss that? Apologies about the bad example.

I was working on the real data of about 24700000 rows stored in dask dataframe (23 partitions) from a csv file. I create a sample dask dataframe by indexing .loc[:100,:] to the original dask dataframe, but when I tried to find the shape, I get 2323 as the number rows.

Can I know how this was calculated? How is the data distributed among all the partitions?

Hi there, and welcome to stack overflow! [Please do not post images of code, data, or errors when asking a question](//meta.stackoverflow.com/questions/285551). Instead, copy it in as [formatted code](/help/formatting), and whenever possible try to create a [mre] so we can reproduce the issue. Thanks! — Michael Delgado, Mar 23 '22 at 21:10

SultanOrazbayev · Answer 1 · 2022-03-31T04:23:50.480

The reason you observe a different number of rows is that .loc will select up to and including the index provided. So this line

only_3 = ddf.loc[:3,:] # this will select 4 rows

is selecting 4 rows, those with index 0,1,2, and 3.

This is based on the pandas API:

A slice object with labels 'a':'f' (Note that contrary to usual Python slices, both the start and the stop are included, when present in the index! See Slicing with labels and Endpoints are inclusive.)

Hence, your code appears to be correct in principle, just take note of this particular pandas-specific index slicing syntax.

Update: if the dask dataframe is constructed by reading a csv file (or in another way that does not generate unique index), then each partition will have its own index.

That means, that calling .loc[:3] will yield at most 4 rows from every partition. For example, if there are 5 partitions and each has 10 rows, then calling .loc[:4].compute() will yield a dataframe with 25 rows (thanks to @darthbith for the correction).

If this is not desirable, there is a way to generate a unique index for every row in the dask dataframe, see this answer.

Thanks for answering my question. Apologies for the bad example? I made an edit to the post about the actual question, would you be able to answer that? — jhanv, Mar 24 '22 at 14:27
I think this is a typo: "then calling .loc[:4].compute() will yield a dataframe with 20 rows." Based on your explanation, `.loc[:4]` will give 5 rows per partition, so either 25 total rows, or it should be `.loc[:3]`, if I understand correctly :-D — darthbith, Mar 31 '22 at 01:27
Thank you, @darthbith, you are correct! I updated the answer. — SultanOrazbayev, Mar 31 '22 at 04:23

Dask Dataframe shape attribute is giving wrong shape

1 Answers1

Linked