create new dataframe based upon max value in one column and corresponding value in a second column

Question

I have a dataframe created by extracting data from a source (network wireless controller).

Dataframe is created off of a dictionary I build. This is basically what I am doing (a sample to show structure - not the actual dataframe):

df = pd.DataFrame({'AP-1': [30, 32, 34, 31, 33, 35, 36, 38, 37],
                   'AP-2': [30, 32, 34, 80, 33, 35, 36, 38, 37],
                   'AP-3': [30, 32, 81, 31, 33, 101, 36, 38, 37],
                   'AP-4': [30, 32, 34, 95, 33, 35, 103, 38, 121],
                   'AP-5': [30, 32, 34, 31, 33, 144, 36, 38, 37],
                   'AP-6': [30, 32, 34, 31, 33, 35, 36, 110, 37],
                   'AP-7': [30, 87, 34, 31, 111, 35, 36, 38, 122],
                   'AP-8': [30, 32, 99, 31, 33, 35, 36, 38, 37],
                   'AP-9': [30, 32, 34, 31, 33, 99, 88, 38, 37]}, index=['1', '2', '3', '4', '5', '6', '7', '8', '9'])


df1 = df.transpose()

This works fine.

Note about the data. Columns 1,2,3 are 'related'. They go together. Same for columns 4,5,6 and 7,8,9. I will explain more shortly.

Columns 1, 4, 7 are client count. Columns 2, 5, 8 are channel util on the 5 Ghz spectrum. Columns 3, 6, 9 are channel util on the 2.4 Ghz spectrum.

Basically I take a reading at 5 minute intervals. The above would represent three readings at 5 minute intervals.

What I want is two new dataframes, two columns each, constructed as follows:

Examine the 5 Ghz columns (here it is 2, 5, 8). Which ever has the highest value becomes column 1 in the new dataframe. Column 2 would be the value of the client count column related to the 5 Ghz column with the highest value. In other words, if column 2 were the highest out of columns 2, 5, 8, then I want the value in column 1 to be the value in the new dataframe for the second column. If the value in column 8 were highest, then I want to also pull the value in column 7. I want the index to be same in the new dataframes as the original -- AP name.

I want to do this for all rows in the 'main' dataframe. I want two new dataframes -- so I will repeat this exact procedure for the 5 Ghz columns and the 2.4 (columns 3, 6, 9 -- also grabbing the corresponding highest client count value for the second column in the new dataframe.

What I have tried:

First I broke the main dataframe into three: df1 has all the client count columns, df2 has the 5 Ghz, and df3 has the 2.4 info, using this:

            # create client count only dataframe
            df_cc = df[df.columns[::3]]
            print(df_cc)
            print()

            # create 5Ghz channel utilization only dataframe
            df_5Ghz = df[df.columns[1::3]]
            print(df_5Ghz)
            print()

            # create 2.4Ghz channel utilization only dataframe
            df_24Ghz = df[df.columns[2::3]]
            print(df_24Ghz)
            print()

This works.

I thought I could then reference the main dataframe, but I don't know how.

Then I found this:

extract column value based on another column pandas dataframe

The query option looked great, but I don't know the value. I need to first discover the max value of the 2.4 and 5 Ghz columns respectively, then grab the corresponding client count value. That is why I first created dataframes containing the 2.4 and 5 Ghz values only, thinking I could first get the max value of each row, then do a lookup on the main dataframe (or use the client count onlydataframe I created), but I just do not know how to realize this idea.

Any assistance would be greatly appreciated.

score 0 · Accepted Answer · answered Oct 14 '18 at 18:54

You can get what you want in 3 steps:

# connection between columns
mapping = {'2': '1', '5': '4', '8': '7'}

# 1. column with highest value among 5GHz values (pandas series)
df2 = df1.loc[:, ['2', '5', '8']].idxmax(axis=1)
df2.name = 'highest value'

# 2. column with client count corresponding to the highest value (pandas series)
df3 = df2.apply(lambda x: mapping[x])
df3.name = 'client count'

# 3. build result using 2 lists of columns (pandas dataframe)
df4 = pd.DataFrame(
    {df.name: [
         df1.loc[idx, col]
         for idx, col in zip(df.index, df.values)]
     for df in [df2, df3]},
    index=df1.index)
print(df4)

Output:

      highest value  client count
AP-1             38            36
AP-2             38            36
AP-3             38            36
AP-4             38           103
AP-5             38            36
AP-6            110            36
AP-7            111            31
AP-8             38            36
AP-9             38            88

I guess while not sure it would be easier to solve the issue (and faster to compute) without pandas using just built-in python data types - dictionaries and lists.

create new dataframe based upon max value in one column and corresponding value in a second column

1 Answers1