1

Working using NLTK and I am prototyping a project I have in mind. I come from PHP so Python is a little unknown for me.

I have a list of stopwords and an n-word string, n being between 1 and 4.

I want to clean that string by trimming both ends of any stopwords. If I need to retest the string after I remove a stopword because there might be another one right after it.

How would you do that performance-wise in Python?

gincard
  • 1,814
  • 3
  • 16
  • 24
Lazhar
  • 1,401
  • 16
  • 37
  • what about: http://stackoverflow.com/questions/5486337/how-to-remove-stop-words-using-nltk-or-python – jmunsch Dec 04 '16 at 10:04

1 Answers1

1

Tokenize the string into words.

Use set membership operators, which are quick, to eliminate leading/trailing tokens while they match the list of stopwords.

If the next step really needs a string, then concatenate the list of words back into one with the idiomatic ' '.join(your_list)

Peteris
  • 3,281
  • 2
  • 25
  • 40
  • 1
    Set membership is the clue here. `set.__contains__()` is a constant time operation vs. `list.__contains__()` which is linear time. Also, if your tokens are in a `list`, deleting elements from the front of the list is a linear time operation, so you could get better performance by optimizing how you strip leading stopwords. – Håken Lid Dec 04 '16 at 11:21