I have a large (~20 million) dataframe containing the posts made by users on various communities. The columns include 'community' and 'text'. Each row corresponds to a post; each community will typically have one thousand to several hundred thousand associated posts.
I would like to output a dictionary containing community names as keys and associated Counters counting the number of occurrences of each token in that community. I thought the easiest way to do this would be to group the Pandas dataframe by community, concatenate the 'text' into a string, clean/tokenize, and apply the Counter, as follows:
def run(data):
counts = {}
data_grouped = data.groupby('community')
for community, group in data_grouped:
community_doc = group['text'].str.cat(sep = '')
community_doc = community_doc.lower()
community_doc = community_doc.translate(None, string.punctuation)
tokens = nltk.word_tokenize(community_doc)
tokens = [token for token in tokens if not token in stopwords.words('english')]
count = Counter(tokens).most_common()
counts[community] = count
return counts
However, the code runs incredibly slowly -- often 20-30 minutes per community. Is there a more efficient way to do this?