I have a long pandas dataframe of emails (90,000) and I want to create a new dataframe where every email will be grouped together by subject. for example if I have 3 emails with the subject 'hello', I would have one column be the subject and the other column would contain a list of 3 email ID's that correspond to the 3 emails. So far I have:
index = 0
for i in range(df.shape[0]):
count = 0
for x in range(bindf.shape[0]):
if (df['Subject'][i] == bindf['Subject'][x]):
bindf['emailID'][x].append(df['Message-ID'][i])
count = 1
if count == 0:
bindf.iloc[index] = [df['Subject'][i],df['Message-ID'][i]]
bindf['emailID'][index] = bindf['emailID'][index].split(' ', maxsplit = 0)
index = index +1
This works, but it is incredibly slow to the point where I would need multiple hours to run it.
NOTE: every email contains a subject and the email ID is a string in the original dataframe, where I want it to be part of a list here