2

I have a dataframe which contains these 4 numeric columns: ['ID', 'A', 'B', 'C']

I want to filter the data in order to obtain a dataframe where, for each unique value in column ID, i get the rows, not duplicated, which correspond the maximum and minimum value of columns A,B,C

The image below shows the input dataframe and the desired output dataframe.

I reported also the df #2 to highlight in blue the rows that are different from a simply max/min searching. Because some of them are duplicated and then should be replaced with the second/third.. maximum/minimum row.

For example, third row of df2 is replaced with the row which contains the second maximum value in column B (63), which is the third row of df1. In the same way, the fourth row of df2 is replaced with the fourth row of df1, because it contains the second minimum of column B (-75)

Moreover:

  1. The number of columns can change, meaning in a larger problem, I could have more columns than just ['A'],['B'], and ['C']

  2. The number of rows for ID can change

  3. The total rows of df3 should be UniqueID*Columns*2

Currently i'm only able to get df2, using idxmax() / idxmin() and then reindex the dataframe

df1 = pd.DataFrame({'ID': pd.Series([1. ,1. , 1. , 1  , 2 , 2, 2,2,2,2,2]),
   'A': pd.Series([100. , -97. , -56. , 69  , 150 , -120, 30,92,35,-41,-75]),
   'B': pd.Series([99., -96., 63., -75., 140, -110, 91,-62,76,10,2]),
   'C': pd.Series([98., -95., -45., 39., 130, -100,90,-50,70,-17,33])})



max = df1.groupby('ID')['A', 'B','C'].idxmax().as_matrix()
min = df1.groupby('ID')['A', 'B','C'].idxmin().as_matrix()

index = []
for i in range(len(max)):
    for j in range(len(max[0])):
        index.append(max[i][j])
        index.append(min[i][j])

df2 = df1.reindex(index)

How can i get df3? The dataframe is large (>1M rows), so I not only need a solution that works, but I also need a solution that is efficient.

TuoCuggino
  • 365
  • 1
  • 4
  • 13
  • 4
    I can't seem to understand what you are trying to do. – Allen Qin Mar 03 '18 at 21:29
  • basically i need to find, for each unique values in columns ID, the rows which has the maximum value in column P, D and C. And the recreate the dataframe with only the maximum rows. But i need to replace the duplicates rows with the second maximum value – TuoCuggino Mar 03 '18 at 21:38
  • so for each unique ID, you want 3 rows, ie, row with max P, max D and max C? – Allen Qin Mar 03 '18 at 21:45
  • yes, i'm already able to do that using idmax(). But this give me some duplicate rows. For example when the row with the maximum value for D is the same of P – TuoCuggino Mar 03 '18 at 21:48
  • so ID 0 also has duplicate rows, why do you keep them as is then? – Allen Qin Mar 03 '18 at 21:57
  • because there is only one row with ID =1 in input dataframe, and i'm searching the maximum of two columns, so at the end there should be two rows with index 0 (ID =1) – TuoCuggino Mar 03 '18 at 22:06
  • please read https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples, provide code would create your desired output dataframe so that people on stackoverflow can test their solutions. screenshots of tables of data are not very good. – Haleemur Ali Mar 05 '18 at 22:02
  • Do either answers accurately answer your question, if so mark it so I can award the bounty – DJK Mar 12 '18 at 21:37
  • hi, technically no, no one answered completely my question – TuoCuggino Mar 12 '18 at 22:30

2 Answers2

3

There is a fast way to keep only the unique rows: df3 = df1.reindex(set(index)). This will keep only the 1st maximum values. Now you can remove the rows with 1st maximum values from df1 by df1 = df1.drop(df3.index), and repeat the whole procedure as many times as you need (e.g. 3 times)

import pandas as pd
df1 = pd.DataFrame({'ID': pd.Series([1. ,1. , 1. , 1  , 2 , 2, 2,2,2,2,2]),
   'A': pd.Series([100. , -97. , -56. , 69  , 150 , -120, 30,92,35,-41,-75]),
   'B': pd.Series([99., -96., 63., -75., 140, -110, 91,-62,76,10,2]),
   'C': pd.Series([98., -95., -45., 39., 130, -100,90,-50,70,-17,33])})

def keep_minmax(df1):
    df_max = df1.groupby('ID')['A', 'B','C'].idxmax().as_matrix()
    df_min = df1.groupby('ID')['A', 'B','C'].idxmin().as_matrix()
    index = []
    for i in range(len(df_max)):
        for j in range(len(df_max[0])):
            index.append(df_max[i][j])
            index.append(df_min[i][j])
    return df1.reindex(set(index))

df = df1.copy()
results = []
for i in range(3):
    result = keep_minmax(df)
    result['order'] = i + 1
    results.append(result)
    df = df.drop(result.index)
df3 = pd.concat(results).sort_values(['ID', 'order'])
print(df3)

it will output

        A      B      C   ID  order
0   100.0   99.0   98.0  1.0      1
1   -97.0  -96.0  -95.0  1.0      1
2   -56.0   63.0  -45.0  1.0      2
3    69.0  -75.0   39.0  1.0      2
4   150.0  140.0  130.0  2.0      1
5  -120.0 -110.0 -100.0  2.0      1
6    30.0   91.0   90.0  2.0      2
7    92.0  -62.0  -50.0  2.0      2
10  -75.0    2.0   33.0  2.0      2
8    35.0   76.0   70.0  2.0      3
9   -41.0   10.0  -17.0  2.0      3

You can see that for ID=1, there is no 3rd order, because all the rows in df1 are already exhausted, and you would have to include duplicate rows (as in your example df3). Do you really want it?

I am asking this, because from your post it is not clear what to do in ambiguous cases: if different rows correspond to k'th best value in different columns, or if this k itself is different for different columns. For example, what kind of df3 would you produce from such df, and why? For simplicity, let's extract only the max values:

   A  B   ID
0  2  1  1.0
1  3  2  1.0
2  1  0  1.0
3  0  3  1.0

My algorithm (looking only for max) would return

   A  B   ID  order
1  3  2  1.0      1
3  0  3  1.0      1
0  2  1  1.0      2
2  1  0  1.0      3

Notice that the row (2, 1) which is 2nd by A, and 3rd by B, is included into the 2'nd order, because it is higher.

Do you have any alternative suggestion how to handle such ambiguities?

David Dale
  • 10,958
  • 44
  • 73
  • Hi, thanks for your answer. Yes I really need to include duplicate rows, if alternative rows are ended. The rows of output dataframe should be 'uniqueID*column*2'. Your example at the moment give a total of 11 rows, 4 rows for 'ID=1' and 7 for 'ID=2' instead of 6+6. – TuoCuggino Mar 06 '18 at 22:08
  • Regarding the ambiguous cases, the algorithm should search for the maximum values inside of each column indipendently. The result on your second example should be like this: 'pd.DataFrame({'ID': pd.Series([1., 1.]), 'A': pd.Series([3., 0.]), 'B': pd.Series([2., 3.])})'. I don't mind if there are two '3' in the resulting dataframe, the important thing is that the rows are different – TuoCuggino Mar 06 '18 at 22:12
  • Why this result? I really don't understand the logic – David Dale Mar 07 '18 at 04:39
  • Because, the algorithm should search for the maximum in column A, which is `3` in row 1, save the row `(3,2)` as the first row of the result dataframe. Then search for the maximum in column `B` which is `3` in row 3 and save the row `(0,3)` as the second row of the result dataframe. – TuoCuggino Mar 07 '18 at 12:43
2

Use an auxiliary function:

def filter_min_and_max(x):
    y = pd.DataFrame()
    for col in x.columns:
        if col != "ID":
            y[col] = [max(x[col]), min(x[col])]
            # for OP's comment
            y[col] = [val_1 for val in zip(x[col].nlargest(3).tolist(), x[col].nsmallest(3).tolist()) for val_1 in val]
    return y

df1.groupby("ID").apply(lambda x: filter_min_and_max(x)).reset_index().drop(["level_1"], axis=1)
Lambda
  • 1,392
  • 1
  • 9
  • 11
  • hi, this solution give an "TypeError: 'numpy.ndarray' object is not callable" – TuoCuggino Mar 12 '18 at 22:28
  • @tuocuggino, firstly, sorry I wrote a "1" at the end of my answer by mistake, now I remove it. And did you init `df1` like your question? I didn't got any error (when remove the last character `1`). – Lambda Mar 12 '18 at 23:15
  • i solved the error, i'm using pandas 0.20.2 and this is the correct syntax `df1.groupby("ID").apply(lambda x: filter_min_and_max(x)).reset_index().drop(["level_1"], axis=1)` Anyway, with your solution i'm getting df3 but without the blue rows, 4 rows in output instead of 12. – TuoCuggino Mar 13 '18 at 22:30
  • @tuocuggino, I can not understand the logic to replace the duplicated rows in `df3`. If you want to remain the elements besides maximum and minimum values, you can change `y[col] = [max(x[col]), min(x[col])]` to `y[col] = [max(x[col]), min(x[col])] + list(x[col][~x[col].isin([max(x[col]), min(x[col])])])`. – Lambda Mar 14 '18 at 00:06
  • can you explain your doubts? i'll try to clarified them. I've added an example in main post, maybe it will helps – TuoCuggino Mar 17 '18 at 11:02
  • @tuocuggino, I think I understand your explanation, I update my answer, please check it. It made me confused that some rows of your desired output `df3` is not totally replaced. – Lambda Mar 17 '18 at 12:03
  • i've checked and actually it doesn't work, for example the output third row is `[ 1.0, 69, 63, 39]` instead it should be `[ 1.0, -56, 63, 45]`. This is due to the fact that your algorithm give the first/second.. max/min values for each rows and column, and not the rows which correspond the first/second.. max/min for that column – TuoCuggino Mar 18 '18 at 16:35
  • I read your example carefully again, I think what you really mean is that if N%2==1(N based on 1), the `N`th row of the result is the row of df1 which contains the `(N+1)//2`th `maximum` value of the Nth column of df1.columns; if N%2==0, the `N`th row of the result is the row of df1 which contains the `(N+1)//2`th `minimum` value of the Nth column of df1.columns. For example, for ID 1, the 3rd row of the result is the row of df1 which contains the `(3+1)//2`nd maximum(because 3%2==1) value of column B(because B is the `(3-1)`nd column of df1.columns). Am I right? – Lambda Mar 19 '18 at 03:08
  • No, the `3rd` row of df3 should contains the row with the maximum of column B (`row 1 of df1`) but, since is already used in the first row of df3, it is replaced with the row which contains the second maximum (`row 3 of df1`). The row 1 of df1 is the first row of df3 because it contains the maximum of column A. – TuoCuggino Mar 20 '18 at 22:02