0

I have a dataframe. I gathered latency data based on each kernel module. Each module's time data is different 3000~1000. I want to slice my data to make each module have equal size of time, specifically from 0 to 1000. below is my original dataframe

time, module_name, latency
0, module1, 268
1, module1, 300
...
999, module1, 300
0, module2, 234
1, module2, 345
...
3000, module2, 345

I sliced my data to make it even size of 1000 using iloc function for each module.

trace1000 = df1.groupby('module_name').apply(lambda x: x.iloc[0:999]

As a result I get below dataframe. As I expected, I get a even size of traces,1000 per module.

module_name, , module_name, latency
module1, 2000, module1, 268
module1, 2001, module1, 300
...
module2, 9085, module1, 234
module2, 9086, module1, 345
...

But, I don't know why I am getting duplicate column name of 'module_name' and strange 2nd column without name. I was trying to drop or only select columns by this but failed.

heat_df = trace1000[["module_name","latency"]]

My goal is to draw a seaborn heatmap(x-axis: time (range 1-1000), y-axis: module_name(number of module:30, heat: latency(range:100~900). I'm expecting graph like below picture

enter image description here

mangosrk
  • 25
  • 3

1 Answers1

0

Let's say we generate a smaller example, consisting of columns time, module_name, and latency. Consider time begin from 0-4, module_name from "module_0" to "module_9" and latency a random variable.

import pandas as pd
import numpy as np

df1 = pd.DataFrame(
    {
    "time": np.tile(np.arange(0,5),10),
    "module_n": np.array([[i]*5 for i in np.arange(10)]).flatten(),
    }
).assign(
    module_name=lambda x: "module_" + x.module_n.astype(str),
    latency=np.random.random(50)
).drop(columns="module_n")
df1

Here's a preview of the output:

    time module_name   latency
0      0    module_0  0.650732
1      1    module_0  0.184202
2      2    module_0  0.741331
3      3    module_0  0.903374
4      4    module_0  0.440044
..   ...         ...       ...
45     0    module_9  0.024248
46     1    module_9  0.468306
47     2    module_9  0.763958
48     3    module_9  0.556926
49     4    module_9  0.696217

[50 rows x 3 columns]

Now you want to apply a .groupby operation, which will take each subset of df1 grouped by the values in column module_name and perform an operation.

Let's take a look at one of these subsets:

df1_group_module_0 = df1.loc[df1.module_name=="module_0"]
df1_group_module_0
   time module_name   latency
0     0    module_0  0.650732
1     1    module_0  0.184202
2     2    module_0  0.741331
3     3    module_0  0.903374
4     4    module_0  0.440044

The operation that you assigned to be applied with the .groupby is lambda x: x.iloc[0:999]. For this smaller example, let's say I want to get only the first two values, so I will use x.iloc[0:2]. Let's see what happens when we apply this operation to one group, the one we selected above:

df1_group_module_1.iloc[0:2]
   time module_name   latency
0     0    module_0  0.650732
1     1    module_0  0.184202

What you get is a new dataframe with columns time, module_name, and latency, and it corresponds to the first two rows of module_0.

The .groupby.apply(lambda x: x.iloc[0:2]) will combine the results of applying the above operation for each group, returning the result and prepending a "module_name" column to the results. This is why you get an extra module_name column, and an extra column which contains the indices of the combined results (which you can remove using .reset_index(drop=True).

It looks like what you are trying to do here is a simpler operation which doesn't require the groupby:

tr2 = df1.loc[df1.time<=1]  # in your case time<=999 
    time module_name   latency
0      0    module_0  0.650732
1      1    module_0  0.184202
5      0    module_1  0.122834
6      1    module_1  0.843534
10     0    module_2  0.108903
..   ...         ...       ...
36     1    module_7  0.720628
40     0    module_8  0.694778
41     1    module_8  0.649239
45     0    module_9  0.024248
46     1    module_9  0.468306

[20 rows x 3 columns]

Another option that simply takes the first N observations is using .head(N) as in this answer https://stackoverflow.com/a/20069379/3828592

Oliver Lopez
  • 220
  • 1
  • 8