2

I would like to take pandas dataframe, group it by one column, sort it by another column and take first element from third column and populate original dataframe.

Here is my original df. I would to group by col_1, sort by col_2 (ascending) and take first element from col_3 and populate col_4 with results.

df_in = pd.DataFrame({'col_1':['A', 'A', 'A', 'B', 'B', 'B'], 'col_2': [5,9,2, 3,7,1],
                   'col_3': ['c','d','k','n','l','f']})

[original_df[1]

Here is how output df should look like:

df_out = pd.DataFrame({'col_1':['A', 'A', 'A', 'B', 'B', 'B'], 'col_2': [5,9,2, 3,7,1],
                   'col_3': ['c','d','k','n','l','f'], 'col_4': ['k','k','k','f','f','f'], })

final_df

I can accomplish grouping and sorting with group and transform, but how to extract first element is not clear.

Sorry SO does not display images correctly ;-(

user1700890
  • 7,144
  • 18
  • 87
  • 183

4 Answers4

3
 df['col_4']=df.sort_values(['col_1','col_2']).groupby('col_1')['col_3'].transform(lambda x: x.iloc[0])

Output:

  col_1  col_2 col_3 col_4
0     A      5     c     k
1     A      9     d     k
2     A      2     k     k
3     B      3     n     f
4     B      7     l     f
5     B      1     f     f
iamchoosinganame
  • 1,090
  • 6
  • 15
  • 1
    Is it necessary `sort_values(['col_1','col_2'])`, I only need to sort by col_2. Maybe `sort_values(['col_2'])` is enought? – user1700890 May 31 '19 at 17:21
2

Try, given your index is as shown,

df_in.sort_values('col_2',  inplace=True)
df_in['col_4'] = df_in.groupby('col_1').col_3.transform('first')

df_in.sort_index(inplace=True)

Output:

+----+---------+---------+---------+---------+
|    | col_1   |   col_2 | col_3   | col_4   |
|----+---------+---------+---------+---------|
|  0 | A       |       5 | c       | k       |
|  1 | A       |       9 | d       | k       |
|  2 | A       |       2 | k       | k       |
|  3 | B       |       3 | n       | f       |
|  4 | B       |       7 | l       | f       |
|  5 | B       |       1 | f       | f       |
+----+---------+---------+---------+---------+
Quang Hoang
  • 146,074
  • 10
  • 56
  • 74
1

You can use

first_values = df_in.sort_values(['col_1','col_2']).groupby('col_1')['col_3'].first().rename('col_4')
df_in = df_in.join(first_values, on='col_1')

Output:

  col_1  col_2 col_3 col_4
0     A      5     c     k
1     A      9     d     k
2     A      2     k     k
3     B      3     n     f
4     B      7     l     f
5     B      1     f     f
WebDev
  • 1,211
  • 11
  • 17
1

Try some thing diff idxmin

s=df_in.groupby(['col_1']).col_2.transform('idxmin')
df_in['New']=df_in.col_3.reindex(s).values
df_in
Out[469]: 
  col_1  col_2 col_3 New
0     A      5     c   k
1     A      9     d   k
2     A      2     k   k
3     B      3     n   f
4     B      7     l   f
5     B      1     f   f
BENY
  • 317,841
  • 20
  • 164
  • 234