I'm relatively new to python and pandas (and coding more generally).
I've been trying to figure out the best way to concatenate specific rows in a pd dataframe. I have a df that looks something like
d = {'sheet_index':[1,2,2,3,3,3,4,5],
'Type':['Note','Note','Note','Note','Note','Note','Note','Note'],
'Strings_To_Concat':['string1','string2','string3','string4','string5','string6','string7','string8']}
df = pd.DataFrame(data=d)
df
sheet_index Type Strings_To_Concat
0 1 Note string1
1 2 Note string2
2 2 Note string3
3 3 Note string4
4 3 Note string5
5 3 Note string6
6 4 Note string7
7 5 Note string8
I want to concatenate the strings in the 'Strings_To_Concat' column if they have the same sheet_index value and are of the Type 'Note' while creating null values in duplicate sheet_index values and not doing anything if the sheet_index isn't duplicated. The result should look something like the below:
sheet_index Type Strings_To_Concat
0 1 Note string1
1 2 Note null
2 2 Note string3 string2
3 3 Note null
4 3 Note null
5 3 Note string6 string5 string4
6 4 Note string7
7 5 Note string8
The problem I'm running into is really just an algorithmic one as I've said I'm relatively new to coding. I've tried looping through the pd to identify which rows are designated as Type 'Note' since there are several types in my dataset and from there if the sheet index is the same as the next one concatenate the next one with the current one.
for i in range(0,len(df)-1):
if(df.iloc[i,1] == 'Note' and df.iloc[i,0] != df.iloc[i+1,0]):
df.loc[i,2] = str(df.iloc[i,2])
elif(df.iloc[i,1] == 'Note' and df.iloc[i,0] == df.iloc[i+1,0]):
df.loc[i+1,2] = str(df.iloc[i+1,2]) + ' ' + str(df.iloc[i,2])
However, this is producing a strange result. Its creating a 4th column with the header '2' and giving me:
sheet_index Type Strings_To_Concat 2
0 1 Note string1 string1
1 2 Note string2 null
2 2 Note string3 string3
3 3 Note string4 null
4 3 Note string5 string5 string4
5 3 Note string6 string6
6 4 Note string7 string7
7 5 Note string8 null
Can someone help me explain why its giving me a new column and the loop isn't behaving as I am suspecting it should? Is there a more elegant solution using pandas methods to doing what I'm trying to achieve other than just using basic for loops?
Any help is very much appreciated