I'm trying to find the shape of a subset dataframe of a larger dask dataframe. But Instead of getting the right shape (# of rows), I'm getting a wrong value
In the example, I stored the first 3 rows into a new dataframe, when I'm trying to find the shape[0], the output is 4 rather than 3. Is there any way to solve this issue?
data = {'Name':['Tom', 'nick', 'nick', 'krish', 'jack', 'jack'], 'Age':[20, 21, 21, 19, 18, 18]}
df = pd.DataFrame(data)
ddf = dd.from_pandas(df, npartitions = 5)
print(ddf.shape[0].compute()) # --> Outputs 6
# Only selecting 3 rows
only_3 = ddf.loc[:3,:]
print(only_3.shape[0].compute()) # --> Outputs 4 (Instead of 3)
EDIT:
How did I miss that? Apologies about the bad example.
I was working on the real data of about 24700000 rows stored in dask dataframe (23 partitions) from a csv file. I create a sample dask dataframe by indexing .loc[:100,:] to the original dask dataframe, but when I tried to find the shape, I get 2323 as the number rows.
Can I know how this was calculated? How is the data distributed among all the partitions?