0

I have a long pandas dataframe of emails (90,000) and I want to create a new dataframe where every email will be grouped together by subject. for example if I have 3 emails with the subject 'hello', I would have one column be the subject and the other column would contain a list of 3 email ID's that correspond to the 3 emails. So far I have:

index = 0
for i in range(df.shape[0]):
    count = 0
    for x in range(bindf.shape[0]):
        if (df['Subject'][i] == bindf['Subject'][x]):
            bindf['emailID'][x].append(df['Message-ID'][i])
            count = 1
    if count == 0:
        bindf.iloc[index] = [df['Subject'][i],df['Message-ID'][i]]
        bindf['emailID'][index] = bindf['emailID'][index].split(' ', maxsplit = 0)
        index = index +1

This works, but it is incredibly slow to the point where I would need multiple hours to run it.

NOTE: every email contains a subject and the email ID is a string in the original dataframe, where I want it to be part of a list here

  • Welcome to StackOverflow. Please take the time to read this post on [how to provide a great pandas example](http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) as well as how to provide a [minimal, complete, and verifiable example](http://stackoverflow.com/help/mcve) and revise your question accordingly. These tips on [how to ask a good question](http://stackoverflow.com/help/how-to-ask) may also be useful. – jezrael Jul 23 '18 at 11:29
  • You could make a new column using something like `df.subject.str.contains('hello')`, and then `groupby` – Josh Friedlander Jul 23 '18 at 11:54
  • You seem to execute the ´count == 0´ part over and over again. Should it not be on the same level as your outer loop or in a different loop? – guidot Jul 23 '18 at 12:48

1 Answers1

0

if you want to group by exactly identical subjects you can:

df.groupby('subject')['ID'].apply(list)

however, most likely the subjects differ even when their semantics dont. So if you are doing that, you might want to apply some transforms to the subject first (all lower, remove empty spaces, punctuation, etc..)

Else, you can make filters for subject such as "contains X".

A reasonable approach as well would be to apply bag of words or word2vec and clustering for grouping.

hope that helps

epattaro
  • 2,330
  • 1
  • 16
  • 29