I'm looking for some help with NLTK, or anything other library that could help me solve the problem I'm facing.
I'm no Python expert (I actually only started learning Python 4 months ago), but I've made quite some research before asking for help:
Tokenizing words into a new column in a pandas dataframe
Passing a pandas dataframe column to an NLTK tokenizer etc...
Here's what I have: a dataframe that contains quite a bit of information about what our students look for (it's the website for a campus) when they're searching for information on our website.
It looks a bit like this:
session | student_query
2020-05-15 09:34:21 | exams session june 2020
2020-05-15 09:41:12 | when are the exams?
2020-05-15 09:59:51 | exams.
2020-05-15 10:02:18 | what's my teacher's email address
What I would like to have, is one big list that looks like: ['query', 'exams', 'session', 'june', '2020', 'when', 'are', 'the', exams', 'exams', 'what', 's', 'my', 'teacher', 's', 'email', 'address] ===> one string, all the words (no sentences), no punctuation.
I have tried:
tokens = df['query'].apply(word_tokenize)
text = nltk.Text(tokens)
===> that gives me an individual string for each row
sentences = pd.Series(df.Name)
sentences = sentences.str.replace('[^A-z ]','').str.replace(' +',' ').str.strip()
splitwords = [ nltk.word_tokenize( str(sentence) ) for sentence in sentences ]
print(splitwords)
===> a bit better, but not what I want either