Let's say we generate a smaller example, consisting of columns time
, module_name
, and latency
. Consider time
begin from 0-4, module_name
from "module_0" to "module_9" and latency a random variable.
import pandas as pd
import numpy as np
df1 = pd.DataFrame(
{
"time": np.tile(np.arange(0,5),10),
"module_n": np.array([[i]*5 for i in np.arange(10)]).flatten(),
}
).assign(
module_name=lambda x: "module_" + x.module_n.astype(str),
latency=np.random.random(50)
).drop(columns="module_n")
df1
Here's a preview of the output:
time module_name latency
0 0 module_0 0.650732
1 1 module_0 0.184202
2 2 module_0 0.741331
3 3 module_0 0.903374
4 4 module_0 0.440044
.. ... ... ...
45 0 module_9 0.024248
46 1 module_9 0.468306
47 2 module_9 0.763958
48 3 module_9 0.556926
49 4 module_9 0.696217
[50 rows x 3 columns]
Now you want to apply a .groupby
operation, which will take each subset of df1
grouped by the values in column module_name
and perform an operation.
Let's take a look at one of these subsets:
df1_group_module_0 = df1.loc[df1.module_name=="module_0"]
df1_group_module_0
time module_name latency
0 0 module_0 0.650732
1 1 module_0 0.184202
2 2 module_0 0.741331
3 3 module_0 0.903374
4 4 module_0 0.440044
The operation that you assigned to be applied with the .groupby
is lambda x: x.iloc[0:999]
. For this smaller example, let's say I want to get only the first two values, so I will use x.iloc[0:2]
. Let's see what happens when we apply this operation to one group, the one we selected above:
df1_group_module_1.iloc[0:2]
time module_name latency
0 0 module_0 0.650732
1 1 module_0 0.184202
What you get is a new dataframe with columns time
, module_name
, and latency
, and it corresponds to the first two rows of module_0
.
The .groupby.apply(lambda x: x.iloc[0:2])
will combine the results of applying the above operation for each group, returning the result and prepending a "module_name" column to the results. This is why you get an extra module_name
column, and an extra column which contains the indices of the combined results (which you can remove using .reset_index(drop=True)
.
It looks like what you are trying to do here is a simpler operation which doesn't require the groupby:
tr2 = df1.loc[df1.time<=1] # in your case time<=999
time module_name latency
0 0 module_0 0.650732
1 1 module_0 0.184202
5 0 module_1 0.122834
6 1 module_1 0.843534
10 0 module_2 0.108903
.. ... ... ...
36 1 module_7 0.720628
40 0 module_8 0.694778
41 1 module_8 0.649239
45 0 module_9 0.024248
46 1 module_9 0.468306
[20 rows x 3 columns]
Another option that simply takes the first N observations is using .head(N) as in this answer https://stackoverflow.com/a/20069379/3828592