0

I'm looking for some help with NLTK, or anything other library that could help me solve the problem I'm facing.

I'm no Python expert (I actually only started learning Python 4 months ago), but I've made quite some research before asking for help:

Tokenizing words into a new column in a pandas dataframe

Passing a pandas dataframe column to an NLTK tokenizer etc...


Here's what I have: a dataframe that contains quite a bit of information about what our students look for (it's the website for a campus) when they're searching for information on our website.

It looks a bit like this:

session             | student_query
2020-05-15 09:34:21 | exams session june 2020
2020-05-15 09:41:12 | when are the exams?
2020-05-15 09:59:51 | exams.
2020-05-15 10:02:18 | what's my teacher's email address

What I would like to have, is one big list that looks like: ['query', 'exams', 'session', 'june', '2020', 'when', 'are', 'the', exams', 'exams', 'what', 's', 'my', 'teacher', 's', 'email', 'address] ===> one string, all the words (no sentences), no punctuation.

I have tried:

tokens = df['query'].apply(word_tokenize)
text = nltk.Text(tokens)

===> that gives me an individual string for each row

sentences = pd.Series(df.Name)
sentences = sentences.str.replace('[^A-z ]','').str.replace(' +',' ').str.strip()
splitwords = [ nltk.word_tokenize( str(sentence) ) for sentence in sentences ]
print(splitwords)

===> a bit better, but not what I want either

halfer
  • 19,824
  • 17
  • 99
  • 186
Louloumonkey
  • 59
  • 1
  • 6

1 Answers1

3

You can just do this:

df['student_query'] = df['student_query'].str.replace(r'\?|\.|\'', ' ')
list_of_words = ' '.join(df['student_query']).split()
print(list_of_words)

['exams', 'session', 'june', '2020', 'when', 'are', 'the', 'exams', 'exams', 'what', 's', 'my', 'teacher', 's', 'email', 'address']
NYC Coder
  • 7,424
  • 2
  • 11
  • 24
  • This is exactly it! Thanks very much dude! :) I had gone for a for loop thing, but what you wrote makes much more sense. Thanks, really! – Louloumonkey May 31 '20 at 15:16