NLTK tokens - creating a single list of words from a pandas series

Question

I'm looking for some help with NLTK, or anything other library that could help me solve the problem I'm facing.

I'm no Python expert (I actually only started learning Python 4 months ago), but I've made quite some research before asking for help:

Tokenizing words into a new column in a pandas dataframe

Passing a pandas dataframe column to an NLTK tokenizer etc...

Here's what I have: a dataframe that contains quite a bit of information about what our students look for (it's the website for a campus) when they're searching for information on our website.

It looks a bit like this:

session             | student_query
2020-05-15 09:34:21 | exams session june 2020
2020-05-15 09:41:12 | when are the exams?
2020-05-15 09:59:51 | exams.
2020-05-15 10:02:18 | what's my teacher's email address

What I would like to have, is one big list that looks like: ['query', 'exams', 'session', 'june', '2020', 'when', 'are', 'the', exams', 'exams', 'what', 's', 'my', 'teacher', 's', 'email', 'address] ===> one string, all the words (no sentences), no punctuation.

I have tried:

tokens = df['query'].apply(word_tokenize)
text = nltk.Text(tokens)

===> that gives me an individual string for each row

sentences = pd.Series(df.Name)
sentences = sentences.str.replace('[^A-z ]','').str.replace(' +',' ').str.strip()
splitwords = [ nltk.word_tokenize( str(sentence) ) for sentence in sentences ]
print(splitwords)

===> a bit better, but not what I want either

NYC Coder · Accepted Answer · 2020-05-30T17:26:29.763

3

You can just do this:

df['student_query'] = df['student_query'].str.replace(r'\?|\.|\'', ' ')
list_of_words = ' '.join(df['student_query']).split()
print(list_of_words)

['exams', 'session', 'june', '2020', 'when', 'are', 'the', 'exams', 'exams', 'what', 's', 'my', 'teacher', 's', 'email', 'address']

edited May 30 '20 at 17:26

answered May 30 '20 at 17:15

NYC Coder

7,424
2
11
24

This is exactly it! Thanks very much dude! :) I had gone for a for loop thing, but what you wrote makes much more sense. Thanks, really! – Louloumonkey May 31 '20 at 15:16

NLTK tokens - creating a single list of words from a pandas series

1 Answers1