Python Pandas Looping

Question

I'm relatively new to python and pandas (and coding more generally).

I've been trying to figure out the best way to concatenate specific rows in a pd dataframe. I have a df that looks something like

d = {'sheet_index':[1,2,2,3,3,3,4,5], 
'Type':['Note','Note','Note','Note','Note','Note','Note','Note'], 
'Strings_To_Concat':['string1','string2','string3','string4','string5','string6','string7','string8']}
df = pd.DataFrame(data=d)
df

          sheet_index   Type    Strings_To_Concat
0         1             Note    string1
1         2             Note    string2
2         2             Note    string3
3         3             Note    string4
4         3             Note    string5
5         3             Note    string6
6         4             Note    string7
7         5             Note    string8

I want to concatenate the strings in the 'Strings_To_Concat' column if they have the same sheet_index value and are of the Type 'Note' while creating null values in duplicate sheet_index values and not doing anything if the sheet_index isn't duplicated. The result should look something like the below:

          sheet_index   Type    Strings_To_Concat
0         1             Note    string1
1         2             Note    null
2         2             Note    string3 string2
3         3             Note    null
4         3             Note    null
5         3             Note    string6 string5 string4
6         4             Note    string7
7         5             Note    string8

The problem I'm running into is really just an algorithmic one as I've said I'm relatively new to coding. I've tried looping through the pd to identify which rows are designated as Type 'Note' since there are several types in my dataset and from there if the sheet index is the same as the next one concatenate the next one with the current one.

    for i in range(0,len(df)-1):
         if(df.iloc[i,1] == 'Note' and df.iloc[i,0] != df.iloc[i+1,0]):
            df.loc[i,2] = str(df.iloc[i,2])
         elif(df.iloc[i,1] == 'Note' and df.iloc[i,0] == df.iloc[i+1,0]):
            df.loc[i+1,2] = str(df.iloc[i+1,2]) + ' ' + str(df.iloc[i,2])

However, this is producing a strange result. Its creating a 4th column with the header '2' and giving me:

          sheet_index   Type    Strings_To_Concat            2
0         1             Note    string1                      string1
1         2             Note    string2                      null
2         2             Note    string3                      string3
3         3             Note    string4                      null
4         3             Note    string5                      string5 string4
5         3             Note    string6                      string6
6         4             Note    string7                      string7
7         5             Note    string8                      null

Can someone help me explain why its giving me a new column and the loop isn't behaving as I am suspecting it should? Is there a more elegant solution using pandas methods to doing what I'm trying to achieve other than just using basic for loops?

Any help is very much appreciated

score 1 · Answer 1 · answered Apr 27 '22 at 16:32

Solution:

m1 = df['Type'].eq('Note')
m2 = df['sheet_index'].duplicated(keep='last')

df['col'] = df[m1].groupby('sheet_index')['Strings_To_Concat'].transform(', '.join)
df['col'] = df['col'].fillna(df['Strings_To_Concat']).mask(m1 & m2)

How it works?

Select the rows where Type=Note, then group by sheet_index and transform Strings_to_Concat using ','.join and assign this result to new col
Fill the null values in new col using the values from Strings_To_Concat so that we can preserve the other values where Type!=Note
Finally mask the values which are duplicated and have Type=Note

Result

   sheet_index  Type Strings_To_Concat                        col
0            1  Note           string1                    string1
1            2  Note           string2                        NaN
2            2  Note           string3           string2, string3
3            3  Note           string4                        NaN
4            3  Note           string5                        NaN
5            3  Note           string6  string4, string5, string6
6            4  Note           string7                    string7
7            5  Note           string8                    string8

Wow apparently pandas is a lot more powerful than I thought to have these methods be able to do it so easily. What change would I make if I wanted to swap the the order of the strings in the new 'col'. If the result needs to be string3, string2 and string6, string5, string4 instead of whats above? — Jonathan Tran, Apr 27 '22 at 17:01
@JonathanTran The change would be simple just reverse the dataframe before `groupby`. Replace `df[m1]` with `df[m1][::-1]` — Shubham Sharma, Apr 27 '22 at 17:06

Python Pandas Looping

1 Answers1