0

I am processing data from a JSON file for machine learning. The data are sentences. The sentences are read into an array and tokenized using NLTK perfectly. So in each sentence array, I am left with something like this ['set', 'a', 'timer', 'for', '*int', '*unit_of_time'], which is totally correct. I would like to remove all elements that contain a ''. This works correctly 90% of the time, but I find that if there are two elements containing an '' in succession, the second element is left behind. So if I run:

words = ['set', 'a', 'timer', 'for', '*int', '*unit_of_time']
words = nltk.word_tokenize(pattern)
    for word in words:
        if '*' in word:
            words.remove(word)

I am left with words = ['set', 'a', 'timer', 'for', '*unit_of_time'], but should be left with `words = ['set', 'a', 'timer', 'for'] The loop successfully removes '*int', but not '*unit_of_time'.

Am I doing this incorrectly? I am using Python 3.7 on Ubuntu 19.10.

If I can provide any additional information, please let me know.

marc.soda
  • 388
  • 1
  • 4
  • 17
  • 1
    Don't change the length of a list while iterating over it... – jonrsharpe Apr 06 '20 at 18:45
  • 3
    Use a list comprehension. `words = [word for word in words if '*' not in word]` – Axe319 Apr 06 '20 at 18:50
  • 1
    To expand on this, when you iterate over a list, it looks at the 1st element, then the second, etc. If you remove the second element and your list now has one less element, it will never iterate over the one that got "bumped" down to the element you removed. (3rd, which is now the 2nd) – Axe319 Apr 06 '20 at 18:57
  • 1
    You can read better answers here. https://stackoverflow.com/questions/1207406/how-to-remove-items-from-a-list-while-iterating – Axe319 Apr 06 '20 at 19:00

0 Answers0