Tokenize and count tokens in grouped Pandas dataframe

Question

I have a large (~20 million) dataframe containing the posts made by users on various communities. The columns include 'community' and 'text'. Each row corresponds to a post; each community will typically have one thousand to several hundred thousand associated posts.

I would like to output a dictionary containing community names as keys and associated Counters counting the number of occurrences of each token in that community. I thought the easiest way to do this would be to group the Pandas dataframe by community, concatenate the 'text' into a string, clean/tokenize, and apply the Counter, as follows:

def run(data):
    counts = {}
    data_grouped = data.groupby('community')
    for community, group in data_grouped:
        community_doc = group['text'].str.cat(sep = '')
        community_doc = community_doc.lower()
        community_doc = community_doc.translate(None, string.punctuation)
        tokens = nltk.word_tokenize(community_doc)
        tokens = [token for token in tokens if not token in stopwords.words('english')]
        count = Counter(tokens).most_common()
        counts[community] = count
    return counts

However, the code runs incredibly slowly -- often 20-30 minutes per community. Is there a more efficient way to do this?

Well, one low-hanging fruit would be to use `sw = set(stopwords.words('english'))` at the top, then use `tokens = [token for token in tokens if not token in sw]` — juanpa.arrivillaga, Jan 01 '18 at 23:02
Another thing you might want to consider is if calling `most_common()` is really necessary, because that will slow you down as well. — juanpa.arrivillaga, Jan 01 '18 at 23:04
You're making the same mistake as https://stackoverflow.com/questions/47769818/why-is-my-nltk-function-slow-when-processing-the-dataframe/47788736#47788736 by looping through the text multiple times for no reason. — alvas, Jan 02 '18 at 02:02
I'm curious as to the source of these similar code that are giving bad examples of preprocessing text in this manner. It's because you're the 5th or 6th person asking the same question this month. @user9161710 could you tell us where you got the example code to preprocess the text from? — alvas, Jan 02 '18 at 02:03
Please see https://stackoverflow.com/questions/47769818/why-is-my-nltk-function-slow-when-processing-the-dataframe and https://stackoverflow.com/questions/48049087/applying-nltk-based-text-pre-proccessing-on-a-pandas-dataframe and https://www.kaggle.com/alvations/basic-nlp-with-nltk — alvas, Jan 02 '18 at 02:18
@alvas thanks for these great resources. I had based the code on https://www2.cs.duke.edu/courses/spring14/compsci290/assignments/lab02.html. I'll give your solution a try. — user9161710, Jan 02 '18 at 02:30
Thanks for the link! Please do read the solutions and understand the steps and apply it to the data. Different data has different level of "noisy" and requires different preprocessing step(s). Have fun! — alvas, Jan 02 '18 at 02:31

Tokenize and count tokens in grouped Pandas dataframe

0 Answers0