0

I am trying to write to file a list of stop words from NLTK.

So, I wrote this script:

import nltk
from nltk.corpus import stopwords
from string import punctuation

file_name = 'OUTPUT.CSV'
file = open(file_name, 'w+')  
_stopwords = set(stopwords.words('english')+list(punctuation)) 
i = 0
file.write(f'\n\nSTOP WORDS:+++\n\n')
for w in _stopwords:
    i=i+1
    out1 = f'{i:3}. {w}\n'
    out2 = f'{w}\n'
    out3 = f'{i:3}. {w}'
    file.write(out2)
    print(out3)

file.close()

The original program used file.write(w), but since I encountered problems, I started trying things.

So, I tried using file.write(out1). That works, but the order of the stop words appear to be random.

What's interesting is that if I use file.write(out2), I only write a random number of stop words that appear to show up in random order, always short of 211. I experience the same problem both in Visual Studio 2017 and Jupyter Notebook.

For example, the last run wrote 175 words ending with:

its
wouldn
shan 

Using file.write(out1) I get all 211 words and the column ends like this:

209. more
210. have
211. ,

Has anyone run into a similar problem. Any idea of what may be going on?

I'm new to Python/NLTK so I decided to ask.

kHarshit
  • 11,362
  • 10
  • 52
  • 71
Luke
  • 1

1 Answers1

0

The reason you are getting a random order of stop words is due to use of set.

_stopwords = set(stopwords.words('english')+list(punctuation)) 

A set is an unordered collection with no duplicate elements. Read more here.

Unlike arrays, where the elements are stored as ordered list, the order of elements in a set is undefined (moreover, the set elements are usually not stored in order of appearance in the set; this allows checking if an element belongs to a set faster than just going through all the elements of the set).

You can use this simple example to check this:

test = set('abcd')
for i in test: 
    print(i) 

It outputs different order (e.g. I tried on two different systems, this is what I got): On Ist system

a
d
b
c

and, on the second system

d
c
a
b

There are other alternatives for ordered sets. Check here.


Besides, I've checked that all three out1, out2, and out3 gives 211 stop words.

kHarshit
  • 11,362
  • 10
  • 52
  • 71
  • Thank you, harshit_k! That makes perfect sense. The 3 outputs are different for me, though. I wish I could attache the files. out1, when written to a file, starts with 1. yours, 2. a, 3. mustn't.... and ends with 209. above, 210. re, 211. (211 words, indeed). out2, though starts with yours, a, mustn't ... and ends with ourselves, now , {, for a total of 56 words!!! If I use file.write(w+'\n') I get 38 words starting with that, an, musnt't and ending with himself, that 'll, under. I will try re-installing my environment. – Luke Jan 27 '19 at 21:08
  • Digging deeper, the problem is with one of the 32 punctuation characters. This works fine: set(stopwords.words('english')). This does not: set(list(punctuation)) I tried encoding the file as utf 8, but that doesn't help either.... – Luke Jan 27 '19 at 21:39