7

I have following two dataframes:

>>> df1
  c1  c2  v1  v2
0  A NaN   9   2
1  B NaN   2   5
2  C NaN   3   5
3  D NaN   4   2

>>> df2
   c1 c2  v1  v2
0   A  P   4   1
1   A  T   3   1
2   A  Y   2   0
3   B  P   0   1
4   B  T   2   2
5   B  Y   0   2
6   C  P   1   2
7   C  T   1   2
8   C  Y   1   1
9   D  P   1   1
10  D  T   2   0
11  D  Y   1   1

I need to concatenate the dataframes and sort them or vice versa. The first dataframe needs to be sorted on v1 column, then the second dataframe needs to be sorted based on the order of the values from c1 column after sorting the first dataframe, and the v2 column from the second dataframe.

A working version is something like this: sorting first dataframe on v1, then iterating the rows, and filtering the second dataframe for the value of c2 column, and sorting the filtered second dataframe on v2, finally concatenating all the frames.

result = []
for i,row in df1.sort_values('v1').iterrows():
    result.append(row.to_frame().T)
    result.append(df2[df2['c1'].eq(row['c1'])].sort_values('v2'))

The resulting dataframe after sorting:

>>> pd.concat(result, ignore_index=True)
   c1   c2 v1 v2
0   B  NaN  2  5
1   B    P  0  1
2   B    T  2  2
3   B    Y  0  2
4   C  NaN  3  5
5   C    Y  1  1
6   C    P  1  2
7   C    T  1  2
8   D  NaN  4  2
9   D    T  2  0
10  D    P  1  1
11  D    Y  1  1
12  A  NaN  9  2
13  A    Y  2  0
14  A    P  4  1
15  A    T  3  1

The problem with above approach is its iterative, and not so efficient when the number of dataframes increases and/or the number of rows increases in these dataframes. The real use-case scenario has from 2 to 6 dataframes, where number of rows ranges from few thousands to hundred thousands.

UPDATE:

Either of sorting the dataframes first then concatenating them, or concatenating the datframes first then sorting, will be fine, that is why I just included both the dataframes instead of just concatenating them and presenting a single dataframe.

EDIT:

Here is 4 dataframes from actual use-case scenario:

from  math import nan
import pandas as pd

df4 = pd.DataFrame({'c1': ['BMI', 'BMI', 'BMI', 'BMI', 'BMI', 'BMI', 'BMI', 'BMI', 'BMI', 'BMI', 'BMI', 'BMI', 'BMI', 'BMI', 'BMI', 'BMI', 'BMI', 'BMI', 'BMI', 'BMI', 'BMI', 'BMI', 'BMI', 'BMI', 'BMI', 'BMI', 'BMI', 'BMI', 'BMI', 'BMI', 'BMI', 'BMI', 'BMI', 'BMI', 'BMI', 'BMI', 'BMI', 'BMI', 'BMI', 'BMI', 'DIABP', 'DIABP', 'DIABP', 'DIABP', 'DIABP', 'DIABP', 'DIABP', 'DIABP', 'DIABP', 'DIABP', 'DIABP', 'DIABP', 'DIABP', 'DIABP', 'DIABP', 'DIABP', 'DIABP', 'DIABP', 'DIABP', 'DIABP', 'DIABP', 'DIABP', 'DIABP', 'DIABP', 'DIABP', 'DIABP', 'DIABP', 'DIABP', 'DIABP', 'DIABP', 'DIABP', 'DIABP', 'DIABP', 'DIABP', 'DIABP', 'DIABP', 'DIABP', 'DIABP', 'DIABP', 'DIABP', 'HEIGHT', 'HEIGHT', 'HEIGHT', 'HEIGHT', 'HEIGHT', 'HEIGHT', 'HEIGHT', 'HEIGHT', 'HEIGHT', 'HEIGHT', 'HEIGHT', 'HEIGHT', 'HEIGHT', 'HEIGHT', 'HEIGHT', 'HEIGHT', 'HEIGHT', 'HEIGHT', 'HEIGHT', 'HEIGHT', 'HEIGHT', 'HEIGHT', 'HEIGHT', 'HEIGHT', 'HEIGHT', 'HEIGHT', 'HEIGHT', 'HEIGHT', 'HEIGHT', 'HEIGHT', 'HEIGHT', 'HEIGHT', 'HEIGHT', 'HEIGHT', 'HEIGHT', 'HEIGHT', 'HEIGHT', 'HEIGHT', 'HEIGHT', 'HEIGHT', 'SYSBP', 'SYSBP', 'SYSBP', 'SYSBP', 'SYSBP', 'SYSBP', 'SYSBP', 'SYSBP', 'SYSBP', 'SYSBP', 'SYSBP', 'SYSBP', 'SYSBP', 'SYSBP', 'SYSBP', 'SYSBP', 'SYSBP', 'SYSBP', 'SYSBP', 'SYSBP', 'SYSBP', 'SYSBP', 'SYSBP', 'SYSBP', 'SYSBP', 'SYSBP', 'SYSBP', 'SYSBP', 'SYSBP', 'SYSBP', 'SYSBP', 'SYSBP', 'SYSBP', 'SYSBP', 'SYSBP', 'SYSBP', 'SYSBP', 'SYSBP', 'SYSBP', 'SYSBP', 'WEIGHT', 'WEIGHT', 'WEIGHT', 'WEIGHT', 'WEIGHT', 'WEIGHT', 'WEIGHT', 'WEIGHT', 'WEIGHT', 'WEIGHT', 'WEIGHT', 'WEIGHT', 'WEIGHT', 'WEIGHT', 'WEIGHT', 'WEIGHT', 'WEIGHT', 'WEIGHT', 'WEIGHT', 'WEIGHT', 'WEIGHT', 'WEIGHT', 'WEIGHT', 'WEIGHT', 'WEIGHT', 'WEIGHT', 'WEIGHT', 'WEIGHT', 'WEIGHT', 'WEIGHT', 'WEIGHT', 'WEIGHT', 'WEIGHT', 'WEIGHT', 'WEIGHT', 'WEIGHT', 'WEIGHT', 'WEIGHT', 'WEIGHT', 'WEIGHT'], 'c2': ['D1', 'D1', 'D1', 'D1', 'D1', 'D1', 'D1', 'D1', 'D1', 'D1', 'Sc', 'Sc', 'Sc', 'Sc', 'Sc', 'Sc', 'Sc', 'Sc', 'Sc', 'Sc', 'w1', 'w1', 'w1', 'w1', 'w1', 'w1', 'w1', 'w1', 'w1', 'w1', 'w2', 'w2', 'w2', 'w2', 'w2', 'w2', 'w2', 'w2', 'w2', 'w2', 'D1', 'D1', 'D1', 'D1', 'D1', 'D1', 'D1', 'D1', 'D1', 'D1', 'Sc', 'Sc', 'Sc', 'Sc', 'Sc', 'Sc', 'Sc', 'Sc', 'Sc', 'Sc', 'w1', 'w1', 'w1', 'w1', 'w1', 'w1', 'w1', 'w1', 'w1', 'w1', 'w2', 'w2', 'w2', 'w2', 'w2', 'w2', 'w2', 'w2', 'w2', 'w2', 'D1', 'D1', 'D1', 'D1', 'D1', 'D1', 'D1', 'D1', 'D1', 'D1', 'Sc', 'Sc', 'Sc', 'Sc', 'Sc', 'Sc', 'Sc', 'Sc', 'Sc', 'Sc', 'w1', 'w1', 'w1', 'w1', 'w1', 'w1', 'w1', 'w1', 'w1', 'w1', 'w2', 'w2', 'w2', 'w2', 'w2', 'w2', 'w2', 'w2', 'w2', 'w2', 'D1', 'D1', 'D1', 'D1', 'D1', 'D1', 'D1', 'D1', 'D1', 'D1', 'Sc', 'Sc', 'Sc', 'Sc', 'Sc', 'Sc', 'Sc', 'Sc', 'Sc', 'Sc', 'w1', 'w1', 'w1', 'w1', 'w1', 'w1', 'w1', 'w1', 'w1', 'w1', 'w2', 'w2', 'w2', 'w2', 'w2', 'w2', 'w2', 'w2', 'w2', 'w2', 'D1', 'D1', 'D1', 'D1', 'D1', 'D1', 'D1', 'D1', 'D1', 'D1', 'Sc', 'Sc', 'Sc', 'Sc', 'Sc', 'Sc', 'Sc', 'Sc', 'Sc', 'Sc', 'w1', 'w1', 'w1', 'w1', 'w1', 'w1', 'w1', 'w1', 'w1', 'w1', 'w2', 'w2', 'w2', 'w2', 'w2', 'w2', 'w2', 'w2', 'w2', 'w2'], 'c3': ['BAF', 'BAF', 'BAF', 'BAF', 'BAF', 'WH', 'WH', 'WH', 'WH', 'WH', 'BAF', 'BAF', 'BAF', 'BAF', 'BAF', 'WH', 'WH', 'WH', 'WH', 'WH', 'BAF', 'BAF', 'BAF', 'BAF', 'BAF', 'WH', 'WH', 'WH', 'WH', 'WH', 'BAF', 'BAF', 'BAF', 'BAF', 'BAF', 'WH', 'WH', 'WH', 'WH', 'WH', 'BAF', 'BAF', 'BAF', 'BAF', 'BAF', 'WH', 'WH', 'WH', 'WH', 'WH', 'BAF', 'BAF', 'BAF', 'BAF', 'BAF', 'WH', 'WH', 'WH', 'WH', 'WH', 'BAF', 'BAF', 'BAF', 'BAF', 'BAF', 'WH', 'WH', 'WH', 'WH', 'WH', 'BAF', 'BAF', 'BAF', 'BAF', 'BAF', 'WH', 'WH', 'WH', 'WH', 'WH', 'BAF', 'BAF', 'BAF', 'BAF', 'BAF', 'WH', 'WH', 'WH', 'WH', 'WH', 'BAF', 'BAF', 'BAF', 'BAF', 'BAF', 'WH', 'WH', 'WH', 'WH', 'WH', 'BAF', 'BAF', 'BAF', 'BAF', 'BAF', 'WH', 'WH', 'WH', 'WH', 'WH', 'BAF', 'BAF', 'BAF', 'BAF', 'BAF', 'WH', 'WH', 'WH', 'WH', 'WH', 'BAF', 'BAF', 'BAF', 'BAF', 'BAF', 'WH', 'WH', 'WH', 'WH', 'WH', 'BAF', 'BAF', 'BAF', 'BAF', 'BAF', 'WH', 'WH', 'WH', 'WH', 'WH', 'BAF', 'BAF', 'BAF', 'BAF', 'BAF', 'WH', 'WH', 'WH', 'WH', 'WH', 'BAF', 'BAF', 'BAF', 'BAF', 'BAF', 'WH', 'WH', 'WH', 'WH', 'WH', 'BAF', 'BAF', 'BAF', 'BAF', 'BAF', 'WH', 'WH', 'WH', 'WH', 'WH', 'BAF', 'BAF', 'BAF', 'BAF', 'BAF', 'WH', 'WH', 'WH', 'WH', 'WH', 'BAF', 'BAF', 'BAF', 'BAF', 'BAF', 'WH', 'WH', 'WH', 'WH', 'WH', 'BAF', 'BAF', 'BAF', 'BAF', 'BAF', 'WH', 'WH', 'WH', 'WH', 'WH'], 'c4': ['001', '002', '003', '004', 'mss', '001', '002', '003', '004', 'mss', '001', '002', '003', '004', 'mss', '001', '002', '003', '004', 'mss', '001', '002', '003', '004', 'mss', '001', '002', '003', '004', 'mss', '001', '002', '003', '004', 'mss', '001', '002', '003', '004', 'mss', '001', '002', '003', '004', 'mss', '001', '002', '003', '004', 'mss', '001', '002', '003', '004', 'mss', '001', '002', '003', '004', 'mss', '001', '002', '003', '004', 'mss', '001', '002', '003', '004', 'mss', '001', '002', '003', '004', 'mss', '001', '002', '003', '004', 'mss', '001', '002', '003', '004', 'mss', '001', '002', '003', '004', 'mss', '001', '002', '003', '004', 'mss', '001', '002', '003', '004', 'mss', '001', '002', '003', '004', 'mss', '001', '002', '003', '004', 'mss', '001', '002', '003', '004', 'mss', '001', '002', '003', '004', 'mss', '001', '002', '003', '004', 'mss', '001', '002', '003', '004', 'mss', '001', '002', '003', '004', 'mss', '001', '002', '003', '004', 'mss', '001', '002', '003', '004', 'mss', '001', '002', '003', '004', 'mss', '001', '002', '003', '004', 'mss', '001', '002', '003', '004', 'mss', '001', '002', '003', '004', 'mss', '001', '002', '003', '004', 'mss', '001', '002', '003', '004', 'mss', '001', '002', '003', '004', 'mss', '001', '002', '003', '004', 'mss', '001', '002', '003', '004', 'mss', '001', '002', '003', '004', 'mss', '001', '002', '003', '004', 'mss'], 'v1': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 0, 2, 0, 2, 4, 6, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 0, 2, 0, 2, 4, 6, 4, 0, 2, 2, 0, 2, 0, 2, 4, 6, 4, 0, 2, 2, 0, 1, 0, 2, 3, 6, 2, 0, 2, 2, 0, 1, 0, 1, 3, 5, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 0, 2, 0, 2, 4, 6, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 0, 2, 0, 2, 4, 6, 4, 0, 2, 2, 0, 2, 0, 2, 4, 6, 4, 0, 2, 2, 0, 1, 0, 2, 3, 6, 2, 0, 2, 2, 0, 1, 0, 1, 3, 5, 1, 0, 2, 2, 0, 2, 0, 2, 4, 6, 4, 0, 2, 2, 0, 2, 0, 2, 4, 6, 4, 0, 2, 2, 0, 1, 0, 2, 3, 6, 2, 0, 2, 2, 0, 1, 0, 1, 3, 5, 1, 0], 'v2': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 2, 4, 6, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 2, 4, 6, 5, 0, 0, 1, 0, 1, 0, 2, 4, 6, 5, 0, 0, 0, 0, 1, 0, 2, 3, 5, 4, 0, 0, 0, 0, 1, 0, 1, 3, 5, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 2, 4, 6, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 2, 4, 6, 5, 0, 0, 1, 0, 1, 0, 2, 4, 6, 5, 0, 0, 0, 0, 1, 0, 2, 3, 5, 4, 0, 0, 0, 0, 1, 0, 1, 3, 5, 3, 0, 0, 1, 0, 1, 0, 2, 4, 6, 5, 0, 0, 1, 0, 1, 0, 2, 4, 6, 5, 0, 0, 0, 0, 1, 0, 2, 3, 5, 4, 0, 0, 0, 0, 1, 0, 1, 3, 5, 3, 0], 'v3': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 1, 0, 0, 1, 5, 9, 7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 1, 0, 0, 1, 5, 9, 7, 0, 1, 2, 1, 0, 0, 1, 5, 9, 7, 0, 1, 2, 1, 0, 0, 0, 4, 6, 4, 0, 1, 2, 1, 0, 0, 0, 2, 6, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 1, 0, 0, 1, 5, 9, 7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 1, 0, 0, 1, 5, 9, 7, 0, 1, 2, 1, 0, 0, 1, 5, 9, 7, 0, 1, 2, 1, 0, 0, 0, 4, 6, 4, 0, 1, 2, 1, 0, 0, 0, 2, 6, 3, 0, 1, 2, 1, 0, 0, 1, 5, 9, 7, 0, 1, 2, 1, 0, 0, 1, 5, 9, 7, 0, 1, 2, 1, 0, 0, 0, 4, 6, 4, 0, 1, 2, 1, 0, 0, 0, 2, 6, 3, 0], 'v4': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 5, 1, 3, 0, 5, 13, 21, 16, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 5, 1, 3, 0, 5, 13, 21, 16, 0, 3, 5, 1, 3, 0, 5, 13, 21, 16, 0, 3, 4, 1, 2, 0, 4, 10, 17, 10, 0, 3, 4, 1, 2, 0, 2, 8, 16, 7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 5, 1, 3, 0, 5, 13, 21, 16, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 5, 1, 3, 0, 5, 13, 21, 16, 0, 3, 5, 1, 3, 0, 5, 13, 21, 16, 0, 3, 4, 1, 2, 0, 4, 10, 17, 10, 0, 3, 4, 1, 2, 0, 2, 8, 16, 7, 0, 3, 5, 1, 3, 0, 5, 13, 21, 16, 0, 3, 5, 1, 3, 0, 5, 13, 21, 16, 0, 3, 4, 1, 2, 0, 4, 10, 17, 10, 0, 3, 4, 1, 2, 0, 2, 8, 16, 7, 0]})
df3 = pd.DataFrame({'c1': ['BMI', 'BMI', 'BMI', 'BMI', 'BMI', 'BMI', 'BMI', 'BMI', 'BMI', 'BMI', 'BMI', 'BMI', 'DIABP', 'DIABP', 'DIABP', 'DIABP', 'DIABP', 'DIABP', 'DIABP', 'DIABP', 'DIABP', 'DIABP', 'DIABP', 'DIABP', 'HEIGHT', 'HEIGHT', 'HEIGHT', 'HEIGHT', 'HEIGHT', 'HEIGHT', 'HEIGHT', 'HEIGHT', 'HEIGHT', 'HEIGHT', 'HEIGHT', 'HEIGHT', 'SYSBP', 'SYSBP', 'SYSBP', 'SYSBP', 'SYSBP', 'SYSBP', 'SYSBP', 'SYSBP', 'SYSBP', 'SYSBP', 'SYSBP', 'SYSBP', 'WEIGHT', 'WEIGHT', 'WEIGHT', 'WEIGHT', 'WEIGHT', 'WEIGHT', 'WEIGHT', 'WEIGHT', 'WEIGHT', 'WEIGHT', 'WEIGHT', 'WEIGHT'], 'c2': ['D1', 'D1', 'D1', 'Sc', 'Sc', 'Sc', 'w1', 'w1', 'w1', 'w2', 'w2', 'w2', 'D1', 'D1', 'D1', 'Sc', 'Sc', 'Sc', 'w1', 'w1', 'w1', 'w2', 'w2', 'w2', 'D1', 'D1', 'D1', 'Sc', 'Sc', 'Sc', 'w1', 'w1', 'w1', 'w2', 'w2', 'w2', 'D1', 'D1', 'D1', 'Sc', 'Sc', 'Sc', 'w1', 'w1', 'w1', 'w2', 'w2', 'w2', 'D1', 'D1', 'D1', 'Sc', 'Sc', 'Sc', 'w1', 'w1', 'w1', 'w2', 'w2', 'w2'], 'c3': ['BAF', 'WH', 'mss', 'BAF', 'WH', 'mss', 'BAF', 'WH', 'mss', 'BAF', 'WH', 'mss', 'BAF', 'WH', 'mss', 'BAF', 'WH', 'mss', 'BAF', 'WH', 'mss', 'BAF', 'WH', 'mss', 'BAF', 'WH', 'mss', 'BAF', 'WH', 'mss', 'BAF', 'WH', 'mss', 'BAF', 'WH', 'mss', 'BAF', 'WH', 'mss', 'BAF', 'WH', 'mss', 'BAF', 'WH', 'mss', 'BAF', 'WH', 'mss', 'BAF', 'WH', 'mss', 'BAF', 'WH', 'mss', 'BAF', 'WH', 'mss', 'BAF', 'WH', 'mss'], 'c4': [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan], 'v1': [0, 0, 0, 6, 16, 0, 0, 0, 0, 0, 0, 0, 6, 16, 0, 6, 16, 0, 5, 13, 0, 5, 10, 0, 0, 0, 0, 6, 16, 0, 0, 0, 0, 0, 0, 0, 6, 16, 0, 6, 16, 0, 5, 13, 0, 5, 10, 0, 6, 16, 0, 6, 16, 0, 5, 13, 0, 5, 10, 0], 'v2': [0, 0, 0, 2, 17, 0, 0, 0, 0, 0, 0, 0, 2, 17, 0, 2, 17, 0, 1, 14, 0, 1, 12, 0, 0, 0, 0, 2, 17, 0, 0, 0, 0, 0, 0, 0, 2, 17, 0, 2, 17, 0, 1, 14, 0, 1, 12, 0, 2, 17, 0, 2, 17, 0, 1, 14, 0, 1, 12, 0], 'v3': [0, 0, 0, 4, 22, 0, 0, 0, 0, 0, 0, 0, 4, 22, 0, 4, 22, 0, 4, 14, 0, 4, 11, 0, 0, 0, 0, 4, 22, 0, 0, 0, 0, 0, 0, 0, 4, 22, 0, 4, 22, 0, 4, 14, 0, 4, 11, 0, 4, 22, 0, 4, 22, 0, 4, 14, 0, 4, 11, 0], 'v4': [0, 0, 0, 12, 55, 0, 0, 0, 0, 0, 0, 0, 12, 55, 0, 12, 55, 0, 10, 41, 0, 10, 33, 0, 0, 0, 0, 12, 55, 0, 0, 0, 0, 0, 0, 0, 12, 55, 0, 12, 55, 0, 10, 41, 0, 10, 33, 0, 12, 55, 0, 12, 55, 0, 10, 41, 0, 10, 33, 0]})
df2 = pd.DataFrame({'c1': ['BMI', 'BMI', 'BMI', 'BMI', 'BMI', 'DIABP', 'DIABP', 'DIABP', 'DIABP', 'DIABP', 'HEIGHT', 'HEIGHT', 'HEIGHT', 'HEIGHT', 'HEIGHT', 'SYSBP', 'SYSBP', 'SYSBP', 'SYSBP', 'SYSBP', 'WEIGHT', 'WEIGHT', 'WEIGHT', 'WEIGHT', 'WEIGHT'], 'c2': ['D1', 'Sc', 'w1', 'w2', 'mss', 'D1', 'Sc', 'w1', 'w2', 'mss', 'D1', 'Sc', 'w1', 'w2', 'mss', 'D1', 'Sc', 'w1', 'w2', 'mss', 'D1', 'Sc', 'w1', 'w2', 'mss'], 'c3': [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan], 'c4': [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan], 'v1': [0, 22, 0, 0, 0, 22, 22, 18, 15, 0, 0, 22, 0, 0, 0, 22, 22, 18, 15, 0, 22, 22, 18, 15, 0], 'v2': [0, 19, 0, 0, 0, 19, 19, 15, 13, 0, 0, 19, 0, 0, 0, 19, 19, 15, 13, 0, 19, 19, 15, 13, 0], 'v3': [0, 26, 0, 0, 0, 26, 26, 18, 15, 0, 0, 26, 0, 0, 0, 26, 26, 18, 15, 0, 26, 26, 18, 15, 0], 'v4': [0, 67, 0, 0, 0, 67, 67, 51, 43, 0, 0, 67, 0, 0, 0, 67, 67, 51, 43, 0, 67, 67, 51, 43, 0]})
df1 = pd.DataFrame({'c1': ['BMI', 'DIABP', 'HEIGHT', 'SYSBP', 'WEIGHT', 'mss'], 'c2': [nan, nan, nan, nan, nan, nan], 'c3': [nan, nan, nan, nan, nan, nan], 'c4': [nan, nan, nan, nan, nan, nan], 'v1': [22, 22, 22, 22, 22, 0], 'v2': [19, 19, 19, 19, 19, 0], 'v3': [26, 26, 26, 26, 26, 0], 'v4': [67, 67, 67, 67, 67, 0]})

# Comment for easy code selection

Even for above four dataframes, sorting and merging criteria is still the same

  • Sorting df1 on v1
  • Sorting c2 in df2 on v2, maintaining the order of c1 from df1
  • Sorting c3 in df3 on v3, maintaining the order of c1 from df1, and c2 from df2
  • Sorting c4 in df3 on v4, maintaining the order of c1 from df1, c2 from df2, and c3 from df3

And in such cases when the number of dataframe to sort and merge grows, the solution I have used above is becoming really inefficient.

ThePyGuy
  • 17,779
  • 5
  • 18
  • 45
  • The above code doesn't produce the shown result dataframe (in 1.3.2). Are there steps missing? – Henry Ecker Aug 21 '21 at 21:42
  • Never mind. It does if you swap `df1` and `df2`. – Henry Ecker Aug 21 '21 at 21:47
  • @HenryEcker, Thanks! I did not notice it. Initially, I was using `df1` as `df2`, and `df2` as `df1`, but later, while asking the question, I swapped `df1`, and `df2` data values, it makes more sense to sort `df1` on `v1` to get `c1`, instead of sorting `df2` on `v1` to get `c1` which can unnecessarily create confusion. I have updated the code accordingly. – ThePyGuy Aug 21 '21 at 21:52
  • Is df1 guaranteed to have only unique values in in c1? – Henry Ecker Aug 21 '21 at 21:56
  • 1
    @HenryEcker, yeah that's always the case. – ThePyGuy Aug 21 '21 at 22:05
  • From the example, the rows of `df1` (with `c2` as `NaN`) always show up first in the result. Is it specified that way? I'm asking because @Corralien's answer puts them _last_, at least with my version of pandas and numpy. – Pierre D Aug 30 '21 at 14:29
  • Yeah, that is always the first item within the group, because it represents the maximum for value of `c1` column which is ordered based on `v1` column. – ThePyGuy Aug 30 '21 at 14:49
  • I stand corrected: @corralien's output is correct. I was using @HenryEcker's setup, where `df1` and `df2` were swapped. – Pierre D Aug 30 '21 at 15:02
  • @PierreD, that was my mistake when I initially posted the question. Later I swapped df1 and df2 after when He pointed out. – ThePyGuy Aug 30 '21 at 15:05
  • no worries -- glad it all makes sense now. – Pierre D Aug 30 '21 at 15:12
  • got a faster version (fastest, as of this writing)... :-) – Pierre D Aug 30 '21 at 19:32
  • ...and yet another version (fastest through the whole size range). Check out `v3_pd()` in [my answer](https://stackoverflow.com/a/68986386/758174). – Pierre D Aug 31 '21 at 03:01

7 Answers7

4

Another solution using groupby without sorting groups:

import itertools

out = pd.concat([df1.sort_values('v1'),
                 df2.sort_values('v2')],
                 ignore_index=True)
# Original answer
# >>> out.reindex(out.groupby('c1', sort=False)
#         .apply(lambda x: x.index)
#         .explode())

# Faster alternative
>>> out.loc[itertools.chain.from_iterable(out.groupby('c1', sort=False)
                                             .groups.values())]
>>> out
   c1   c2  v1  v2
0   B  NaN   2   5
8   B    P   0   1
12  B    T   2   2
13  B    Y   0   2
1   C  NaN   3   5
9   C    Y   1   1
14  C    P   1   2
15  C    T   1   2
2   D  NaN   4   2
5   D    T   2   0
10  D    P   1   1
11  D    Y   1   1
3   A  NaN   9   2
4   A    Y   2   0
6   A    P   4   1
7   A    T   3   1
Corralien
  • 109,409
  • 8
  • 28
  • 52
  • Thanks for the answer, I'll test how well this solution fits in my use-case scenario. – ThePyGuy Aug 21 '21 at 23:12
  • 1
    @ThePyGuy, I updated my answer by a faster alternative. Can you check this, please? – Corralien Aug 25 '21 at 06:12
  • Thanks for the updated solution, I'll definitely try both the methods. – ThePyGuy Aug 25 '21 at 06:30
  • wait a minute: this puts the rows of `df1` (with `c2==NaN`) _last_ in each group. The OP wants it first. – Pierre D Aug 30 '21 at 13:41
  • @PierreD. I don't understand why your output is different. The output of my two methods give the expected outcome. – Corralien Aug 30 '21 at 14:52
  • apologies: I copied the testing setup by @HenryHecker, and in that setup `df1`, `df2` are swapped. All good. – Pierre D Aug 30 '21 at 15:00
  • See my [answer](https://stackoverflow.com/a/68986386/758174) for a speed test sweep using `perfplot`: so far your method dominates :-) – Pierre D Aug 30 '21 at 15:16
  • ...no longer, at least for large scale problems (more than between 10k rows and 130k rows in `df2`, depending on the number of groups in `df1`). – Pierre D Aug 30 '21 at 19:28
4

Here is another way which might help you:

  1. Create a rank dictionary with c1 as keys and rank of v1 as values
  2. Concat both the dataframes with keys as 1 and 2 (This will help prioritizing the left dataframe)
  3. Assign a helper rank column to the concated dataframe and then sort on that, then Key and then v2.

Method 1:

def mysort(df1,df2):
    d = dict(zip(df2['c1'],df2['v1'].rank()))
    o = pd.concat((df2,df1),keys=[1,2],names=['Key'])

    return (o.assign(k=o['c1'].map(d)).sort_values(['k','Key','v2'])
            .loc[:,list(df1)])#.reset_index(drop=True)

Method 2:

I think should work faster using the same logic but sorting with lexsort:

def mysort_two(df1,df2):
    d  = dict(zip(df2['c1'],df2['v1'].rank()))
    o = pd.concat((df2,df1),keys=[1,2],names=['Key'])
    a = o.to_numpy()[np.lexsort((o['v2'],o.index.get_level_values('Key'),
                o['c1'].map(d)))]
    return pd.DataFrame(a,columns=df2.columns)

#Same can also be written as below:
# def mysort_two(df1,df2):
#     d  = dict(zip(df2['c1'],df2['v1'].rank()))
#     o = pd.concat((df2,df1))
#     a = o.to_numpy()[np.lexsort((o['v2']
#                 ,np.append(np.ones(len(df2)), np.ones(len(df1))*2),
#                 o['c1'].map(d)))]
#     return pd.DataFrame(a,columns=df2.columns)

print(mysort_two(df1,df2)) #method2
#print(mysort(df1,df2)) #method1

   c1   c2  v1  v2
0   B  NaN   2   5
1   B    P   0   1
2   B    T   2   2
3   B    Y   0   2
4   C  NaN   3   5
5   C    Y   1   1
6   C    P   1   2
7   C    T   1   2
8   D  NaN   4   2
9   D    T   2   0
10  D    P   1   1
11  D    Y   1   1
12  A  NaN   9   2
13  A    Y   2   0
14  A    P   4   1
15  A    T   3   1
anky
  • 74,114
  • 11
  • 41
  • 70
  • 1
    Thanks for the second method as well, I'll try that too – ThePyGuy Aug 27 '21 at 16:33
  • 1
    `mysort_two` is much faster in my benchmark. I've updated my timings to include your answers. – Henry Ecker Aug 27 '21 at 16:38
  • @HenryEcker I think even more efficient time can be achieved by replacing `o.index.get_level_values('Key')` by `np.append(np.ones(len(df2)), np.ones(len(df1))*2)` and removing the `keys` param from concat in the second function , but I guess OP needs to see if they want to take readability for that tradeoff of not :-) – anky Aug 27 '21 at 16:53
  • The update gave me almost the exact same timing and variation. `1.69 ms ± 52.9 µs` – Henry Ecker Aug 27 '21 at 16:59
  • @HenryEcker I see ... interesting, I am still rooting fir it to be slightly more efficient when the dfs grow. Only because I am using a single array intead of keys in concat and then getting their values. Minor things but.... – anky Aug 27 '21 at 17:02
  • 1
    I agree. The sample side is kinda small to show scaling. I want to run a perfplot, but I don't know how to build good representative data that will scale (I don't want to make incorrect assumptions about how this problem scales). – Henry Ecker Aug 27 '21 at 17:03
  • @HenryEcker I was trying it out as well but my numpy (`np.typing`)dependency sucks :/ – anky Aug 27 '21 at 17:04
  • @HenryEcker, if you want larger subset of the sample data, I can provide it. I'll update the question and let you know – ThePyGuy Aug 27 '21 at 17:09
  • @ThePyGuy Also, if you want to test, here is an [example](https://stackoverflow.com/a/54432584/9840637) only if you have numpy 1.20+ I think :-) – anky Aug 27 '21 at 17:11
  • For a perfplot, I really need a function to be able to pass a number `n` and have it programmatically build the representative datafames of that size. That's how we can see how the solutions scale. – Henry Ecker Aug 27 '21 at 17:12
  • @HenryEcker: I built on top of your setup to provide a way to test using `perfplot`. – Pierre D Aug 30 '21 at 15:13
3

Assuming df2's c1 column contains unique values in c1 (as in OP) we can try establishing categorical ordering in c1 based on the sorted values of v1 in df2. Adding indicator values to each DataFrame, then concating and sorting based on the new categorical type (c1), indicator, and v1.

# Establish ordering based on sorted df2
cat_type = pd.CategoricalDtype(df2.sort_values('v1')['c1'], ordered=True)
new_df = (
    # Add Indicator to each DataFrame
    pd.concat([df.assign(indicator=i) for (i, df) in enumerate([df1, df2])])
        # Set c1 to the categorical ordering from above
        .astype({'c1': cat_type})
        # Sort By Categorical, then df2 first then v2 within dfs
        .sort_values(['c1', 'indicator', 'v2'],
                     ascending=(True, False, True),
                     ignore_index=True)
        # Remove Indicator column
        .drop(columns='indicator')
)

new_df:

   c1   c2  v1  v2
0   B  NaN   2   5
1   B    P   0   1
2   B    T   2   2
3   B    Y   0   2
4   C  NaN   3   5
5   C    Y   1   1
6   C    P   1   2
7   C    T   1   2
8   D  NaN   4   2
9   D    T   2   0
10  D    P   1   1
11  D    Y   1   1
12  A  NaN   9   2
13  A    Y   2   0
14  A    P   4   1
15  A    T   3   1

Setup:

import pandas as pd
from numpy import nan

df1 = pd.DataFrame({
    'c1': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C', 'D', 'D', 'D'],
    'c2': ['P', 'T', 'Y', 'P', 'T', 'Y', 'P', 'T', 'Y', 'P', 'T', 'Y'],
    'v1': [4, 3, 2, 0, 2, 0, 1, 1, 1, 1, 2, 1],
    'v2': [1, 1, 0, 1, 2, 2, 2, 2, 1, 1, 0, 1]
})

df2 = pd.DataFrame({
    'c1': ['A', 'B', 'C', 'D'],
    'c2': [nan, nan, nan, nan],
    'v1': [9, 2, 3, 4],
    'v2': [2, 5, 5, 2]
})

Some timing information:

OP

%timeit op_fn(df1, df2)
4.93 ms ± 376 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

This answer

%timeit cat_ordered(df1, df2)
3.72 ms ± 490 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

@Andrej Kesely's answer

%timeit groupby_fn(df1, df2)
7.79 ms ± 140 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

@Corralien's answer

%timeit concat_itertools(df1, df2)
1.48 ms ± 82.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

@Scott Boston's answer

%timeit helpcol(df1, df2)
3.05 ms ± 44.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

@U12-Forward's answer

%timeit U11(df1, df2)
5.41 ms ± 255 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

@Anky's answers

%timeit mysort_anky(df1, df2)
3.05 ms ± 180 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit mysort_two_anky(df1, df2)
1.69 ms ± 52.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Setup:

import itertools

import numpy as np
import pandas as pd
from numpy import nan

df1 = pd.DataFrame({
    'c1': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C', 'D', 'D', 'D'],
    'c2': ['P', 'T', 'Y', 'P', 'T', 'Y', 'P', 'T', 'Y', 'P', 'T', 'Y'],
    'v1': [4, 3, 2, 0, 2, 0, 1, 1, 1, 1, 2, 1],
    'v2': [1, 1, 0, 1, 2, 2, 2, 2, 1, 1, 0, 1]
})

df2 = pd.DataFrame({
    'c1': ['A', 'B', 'C', 'D'],
    'c2': [nan, nan, nan, nan],
    'v1': [9, 2, 3, 4],
    'v2': [2, 5, 5, 2]
})


def op_fn(df1, df2):
    result = []
    for i, row in df2.sort_values('v1').iterrows():
        result.append(row.to_frame().T)
        result.append(df1[df1['c1'].eq(row['c1'])].sort_values('v2'))
    return pd.concat(result, ignore_index=True)


def cat_ordered(df1, df2):
    # Establish ordering based on df2
    cat_type = pd.CategoricalDtype(df2.sort_values('v1')['c1'], ordered=True)
    return (
        # Add Indicator to each DataFrame
        pd.concat([df.assign(indicator=i) for (i, df) in enumerate([df1, df2])])
            # Set c1 to the categorical ordering from above
            .astype({'c1': cat_type})
            # Sort By Categorical, then df2 first then v2 within dfs
            .sort_values(['c1', 'indicator', 'v2'],
                         ascending=(True, False, True),
                         ignore_index=True)
            # Remove Indicator column
            .drop(columns='indicator')
    )


def groupby_fn(df1, df2):
    y = df1.assign(ind=df1['v1'])
    x = df2.groupby("c1").apply(
        lambda v: pd.concat(
            [y[y["c1"].eq(v["c1"].iat[0])], v.sort_values("v2")]
        )
    )
    x.loc[:, "ind"] = x.loc[:, "ind"].ffill()
    return x.sort_values("ind").drop(columns="ind").reset_index(drop=True)


def concat_itertools(df1, df2):
    out = pd.concat([df1.sort_values('v1'),
                     df2.sort_values('v2')],
                    ignore_index=True)
    return out.loc[itertools.chain.from_iterable(out.groupby('c1', sort=False)
                                                 .groups.values())]


def helpcol(df1, df2):
    cat_type = pd.CategoricalDtype(df2.sort_values('v1')['c1'], ordered=True)
    dfc = pd.concat([df1, df2])
    dfc["c2sort"] = dfc["c2"].notna()
    dfc["c1sort"] = dfc["c1"].astype(cat_type)
    return dfc.sort_values(["c1sort", "c2sort", "v2"], ignore_index=True).drop(
        ["c2sort", "c1sort"], axis=1
    )


def U11(df1, df2):
    df = pd.concat([df1, df2], ignore_index=True)
    return (
        df.reindex(
            df.sort_values('c1')
                .groupby('c1', as_index=False)['v1'].transform('min')
                .squeeze().sort_values().index
        ).reset_index(drop=True)
    )


def mysort_anky(df1, df2):
    d = dict(zip(df2['c1'], df2['v1'].rank()))
    o = pd.concat((df2, df1), keys=[1, 2], names=['Key'])

    return (o.assign(k=o['c1'].map(d)).sort_values(['k', 'Key', 'v2'])
                .loc[:, list(df1)])  # .reset_index(drop=True)


def mysort_two_anky(df1, df2):
    d = dict(zip(df2['c1'], df2['v1'].rank()))
    o = pd.concat((df2, df1), keys=[1, 2], names=['Key'])
    a = o.to_numpy()[np.lexsort((o['v2'], o.index.get_level_values('Key'),
                                 o['c1'].map(d)))]
    return pd.DataFrame(a, columns=df2.columns)
Henry Ecker
  • 34,399
  • 18
  • 41
  • 57
  • Thanks for such a detailed answer, It'll take me a while to digest the solution. – ThePyGuy Aug 21 '21 at 22:07
  • That's fine. I'm still thinking. I feel there may be a more clever way to put this together. But I did get a little time off. – Henry Ecker Aug 21 '21 at 22:08
  • Yeah, I also thought so initially. And this sorting I'm looking for is somewhat unusual type of sorting. – ThePyGuy Aug 21 '21 at 22:10
  • One solution that was on my mind was to have the series of `c1` after sorting `df1`, then to assign/map the index and value of the series to a new column to `df2`, and then sort `df2` on that new column and `v2` which sorts `df2` in a single shot, but then concatenating will cause the problem. – ThePyGuy Aug 21 '21 at 22:14
  • @HenryEcker Could you please benchmark also the solution with `.groupby`? +1 :) – Andrej Kesely Aug 21 '21 at 22:26
  • @AndrejKesely done. Seems it's about twice the time as OP. – Henry Ecker Aug 21 '21 at 22:38
  • @HenryEcker Thanks, I've fixed my code with your categorical type, but I still think the groupby solution will be slower (too many dataframes are being created). – Andrej Kesely Aug 21 '21 at 22:48
  • Hi @HenryEcker. If you have time, can you include my solution to your benchmark, please? – Corralien Aug 25 '21 at 06:13
  • Done @Corralien quite speedy on the provided sample. Nice work. – Henry Ecker Aug 25 '21 at 14:25
  • @HenryEcker, Thanks for the benchmarks, it'll help me pick a solution up. – ThePyGuy Aug 26 '21 at 05:20
3

Here is one way to do this, plus a setup to compare various solutions using perfplot.

As of this writing, v3_pd() below is the fastest solution overall among the ones presented so far, through the full range of sizes (rows of df2) and number of groups (rows of df1).

The idea is to use sort on as small groups as possible (O[n log n]), even if we use sort multiple times (sum(k*log(k)) <= sum(k)*log(sum(k))).

# obsolete
#def v_pd(df1, df2):
#    z1 = df1.set_index('c1')
#    ix = z1.sort_values('v1').index
#    z = df2.set_index('c1').groupby(level=0, #group_keys=False).apply(pd.DataFrame.sort_values, 'v2')
#    return pd.concat([z1, z]).loc[ix].reset_index()

Edit: faster version, same principle:

def fun(g):
    return pd.Series(g.index.values[np.r_[0, 1 + np.argsort(g.values[1:])]])

def v2_pd(df1, df2):
    z = pd.concat([df1.sort_values('v1'), df2]).reset_index(drop=True)  # make sure we have a clean RangeIndex
    gb = z['v2'].groupby(z['c1'], sort=False)
    return z.iloc[gb.apply(fun).values]

Edit: yet a faster version, still same principle:

def get_group_sort(group_i, k, v):
    ix = np.where(group_i == k)[0]
    ix1 = ix[1:]
    return np.r_[ix[0], ix1[np.argsort(v[ix1])]]

def v3_pd(df1, df2):
    ng = len(df1)
    z = pd.concat([df1.sort_values('v1'), df2]).reset_index(drop=True)  # make sure we have a clean RangeIndex
    gb = z['v2'].groupby(z['c1'], sort=False)
    group_i = gb.ngroup().values
    v2 = z['v2'].values
    ix = np.concatenate([get_group_sort(group_i, k, v2) for k in range(ng)])
    return z.iloc[ix]

Quick check:

>>> v2_pd(df1, df2).equals(v3_pd(df1, df2))
True

>>> v3_pd(df1, df2)
   c1   c2  v1  v2
0   B  NaN   2   5
1   B    P   0   1
2   B    T   2   2
3   B    Y   0   2
4   C  NaN   3   5
5   C    Y   1   1
6   C    P   1   2
7   C    T   1   2
8   D  NaN   4   2
9   D    T   2   0
10  D    P   1   1
11  D    Y   1   1
12  A  NaN   9   2
13  A    Y   2   0
14  A    P   4   1
15  A    T   3   1

Speed tests

Building on top of @HenryEcker's answer that kindly provided a super-convenient definition of the various functions (and fixing the issue with swapped df1, df2 for helpcol() and cat_ordered()), we further add:

# edit: use floats for v1, v2, so that results can be
# checked for equality

def gen_example(n, ng=10):
    df1 = pd.DataFrame({
        'c1': [f'g{i}' for i in range(ng)],
        'c2': np.nan,
        'v1': np.random.normal(size=ng),
        'v2': np.random.normal(size=ng),
    })
    df2 = pd.DataFrame({
        'c1': df1['c1'].sample(n, replace=True).sort_values(),
        'c2': np.random.choice(list(ascii_uppercase), n),
        'v1': np.random.normal(size=n),
        'v2': np.random.normal(size=n),
    })
    return df1, df2

This generates arbitrarily large df1 and df2.

Then, we transform all functions into ker_{funcname} with single argument being the tuple (df1, df2):

import inspect

def is_f_df1_df2(f):
    try:
        return inspect.getfullargspec(f)[0] == ['df1', 'df2']
    except TypeError:
        return False

for k in [k for k, f in locals().items() if is_f_df1_df2(f)]:
    exec(f'def ker_{k}(tup): return {k}(*tup)')

# note: some functions choke on df1, df2, will investigate later
not_working = {
     # ker_cat_ordered,  # Now fixed (was: ValueError: Categorical categories must be unique  -- due to df1, df2 being swapped)
     # ker_helpcol,  # Now fixed (was: ValueError: Categorical categories must be unique  -- due to df1, df2 being swapped)
     ker_op_fn,  # this one works, but it is too slow
     ker_v_pd,  # and this one is too slow too
}

kernels = [
    f for name, f in locals().items()
    if hasattr(f, '__name__') and name.startswith('ker_') and f not in not_working
]

>>> kernels
[<function __main__.ker_cat_ordered(tup)>,
 <function __main__.ker_groupby_fn(tup)>,
 <function __main__.ker_concat_itertools(tup)>,
 <function __main__.ker_helpcol(tup)>,
 <function __main__.ker_U11(tup)>,
 <function __main__.ker_mysort_anky(tup)>,
 <function __main__.ker_mysort_two_anky(tup)>,
 <function __main__.ker_v2_pd(tup)>,
 <function __main__.ker_v3_pd(tup)>]

Now, we can use perfplot:

import perfplot

ng = 100
o = perfplot.bench(
    setup=lambda n: gen_example(n, ng),
    kernels=kernels,
    n_range=np.power(2, np.arange(*np.log2([16, 20e6]))).astype(int),
    equality_check=None,
)

z = pd.DataFrame(dict(zip(o.labels, o.timings_s)), index=o.n_range)
order = z.max().sort_values(ascending=False).index
ax = z[order].plot(logx=True, logy=True)
unit_ticks(z, unit='s', subs=(1, 3), dec=0, log=True, axis=ax.yaxis)
unit_ticks(z.index, unit='', system='si1000', subs=(1, 3), dec=0, log=True, axis=ax.xaxis)

plt.title(f'with ng={ng} groups')
plt.xlabel('n rows in df2')
plt.ylabel('execution time')

(Side note: please forgive me not to list here the code for unit_ticks(); it is a bit long and beside the point here. It just changes the axes labels to make things a bit easier to read).

This is repeated for other values of ng. Overall, we get (lower is better):

(click image for full resolution)

We can also look at the execution times, relative to that of v3_pd() (lower is better):

(click image for full resolution)

The method presented here (v3_pd()) is the fastest through the whole range of size. v2_pd() is 2nd fastest for large problems, but is comparatively slow for small sizes.

Pierre D
  • 24,012
  • 7
  • 60
  • 96
  • Thank you for the detailed answer and benchmarks – ThePyGuy Aug 30 '21 at 15:14
  • Excellent!! Thanks for sharing :-) *Wish I could upgrade my numpy too for using this* +1 Also looks like I need a mysort1.5 version to balance both :P – anky Aug 30 '21 at 15:16
  • 1
    @anky: yes indeed, I have a separate conda env with `numpy>=1.20` just so that I can use `perfplot`. Also, I don't use their `drufte` way of plotting things. I find it beautiful but too restrictive (can't add plot title, etc.) – Pierre D Aug 30 '21 at 15:33
  • ok, just posted a new version `v2_pd` that is so far the fastest for large scale `df2`. – Pierre D Aug 30 '21 at 19:26
  • and now yet a faster version `v3_pd`. That one is fastest through roughly the whole size range. – Pierre D Aug 31 '21 at 02:59
2

Another solution:

cat_type = pd.CategoricalDtype(df1.sort_values("v1")["c1"], ordered=True)

x = (
    df2.groupby("c1")
    .apply(
        lambda x: pd.concat(
            [df1[df1["c1"].eq(x["c1"].iat[0])], x.sort_values("v2")]
        )
    )
    .reset_index(drop=True)
)
x["c1"] = x["c1"].astype(cat_type)
print(x.sort_values("c1"))

Prints:

   c1   c2  v1  v2
4   B  NaN   2   5
5   B    P   0   1
6   B    T   2   2
7   B    Y   0   2
8   C  NaN   3   5
9   C    Y   1   1
10  C    P   1   2
11  C    T   1   2
12  D  NaN   4   2
13  D    T   2   0
14  D    P   1   1
15  D    Y   1   1
0   A  NaN   9   2
1   A    Y   2   0
2   A    P   4   1
3   A    T   3   1
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
  • Thanks for the solution, I'll take a look. Just a question, does `df1["ind"] = df1["v1"]` expect `v1` to be unique in `df1`? Because unique value of `v1` in `df1` is just a coincidence in the above dataframe in question. – ThePyGuy Aug 21 '21 at 22:28
  • 1
    @ThePyGuy I use the `ind` column just for final sorting - so no, I don't expect it to be unique. – Andrej Kesely Aug 21 '21 at 22:30
  • @ThePyGuy When I'm thinking about it now, It won't work when `v1` is not unique - the groups won't sort correctly. I need to come with other solution. – Andrej Kesely Aug 21 '21 at 22:33
  • @ThePyGuy Fixed. I've borrowed the categorical type from Henry's solution. But I think the groupby will be slower. – Andrej Kesely Aug 21 '21 at 22:47
2

Let's create two helper columns to sort concatenated dataframes after use pd.CategoricalDType for sorting. No grouping, no looping, just sorting on pseudo columns.

def helpcol(df1, df2):
    cat_type = pd.CategoricalDtype(df2.sort_values('v1')['c1'], ordered=True)
    dfc = pd.concat([df1, df2])
    dfc["c2sort"] = dfc["c2"].notna()
    dfc["c1sort"] = dfc["c1"].astype(cat_type)
    return dfc.sort_values(["c1sort", "c2sort", "v2"], ignore_index=True).drop(
        ["c2sort", "c1sort"], axis=1
    )

Using @HernyEcker setup....

   helpcol(df1, df2)

Output:

   c1   c2  v1  v2
0   B  NaN   2   5
1   B    P   0   1
2   B    T   2   2
3   B    Y   0   2
4   C  NaN   3   5
5   C    Y   1   1
6   C    P   1   2
7   C    T   1   2
8   D  NaN   4   2
9   D    T   2   0
10  D    P   1   1
11  D    Y   1   1
12  A  NaN   9   2
13  A    Y   2   0
14  A    P   4   1
15  A    T   3   1
Scott Boston
  • 147,308
  • 15
  • 139
  • 187
2

Solution:

You could try it with this two-liner with concat, reindex, sort_values, groupby, transform, squeeze and reset_index:

>>> df = pd.concat([df1, df2], ignore_index=True)
>>> df.reindex(df.sort_values('c1').groupby('c1', as_index=False)['v1'].transform('min').squeeze().sort_values().index).reset_index(drop=True)
   c1   c2  v1  v2
0   B  NaN   2   5
1   B    P   0   1
2   B    T   2   2
3   B    Y   0   2
4   C  NaN   3   5
5   C    P   1   2
6   C    T   1   2
7   C    Y   1   1
8   D  NaN   4   2
9   D    P   1   1
10  D    T   2   0
11  D    Y   1   1
12  A  NaN   9   2
13  A    P   4   1
14  A    T   3   1
15  A    Y   2   0
>>> 

Timings:

Timing with the following code:

def U11():
    for i in range(1000):
        df = pd.concat([df1, df2], ignore_index=True)
        df = df.reindex(df.sort_values('c1').groupby('c1', as_index=False)['v1'].transform('min').squeeze().sort_values().index).reset_index(drop=True)

def ThePyGuy():
    for i in range(1000):
        result = []
        for i,row in df1.sort_values('v1').iterrows():
            result.append(row.to_frame().T)
            result.append(df2[df2['c1'].eq(row['c1'])].sort_values('v2'))
        df = pd.concat(result, ignore_index=True)

a = time.time()
ThePyGuy()
b = time.time()
print('ThePyGuy:', b - a)

a = time.time()
U11()
b=time.time()
print('U11:', b-a)

Output:

ThePyGuy: 5.920747756958008
U11: 5.1511549949646

So my answer is almost a second faster...

U13-Forward
  • 69,221
  • 14
  • 89
  • 114