0

Hello Community Members,

I would like to output the 1000 most frequently used words with frequency from a Gensim Word2Vec model. However, I am not interested in certain words, which I therefore filter using numpy (np.stdiff1d).After that I create a new list using '/n'.join, but now I have the problem that every time I call an entry from the list '/n'.join is entered in front of the word (e.g. instead of house /nhouse), so I get a key error.

I tried to work around it by saving the list (corpus_words) as .txt and “open with“, but even then, there is a /n in front of each entry, when I try to get the frequency of the word.

to use a print statement beforer "/n".join(new_list) did not help either.

is there any way to fix this?

Model_Pfad = r'D:\OneDrive\Phyton\modelC.model'
ausgabe= open('D:\OneDrive\Phyton\wigbelsZahlen.txt', 'w')

model = Word2Vec.load(Model_Pfad)


x = list(model.wv.index_to_key[:1000])

stop_words = set (["an",
              'as',
              'art',
              'ab',
              'al',
            "aber",
            "abk.",
            "alle",
            "allem",
            "allen",
            "aller",
            "alles",
            "allg."
            ])

new_list = [item for item in x if item not in stop_words]

for i in new_list:
    result = model.wv.get_vecattr(i, "count")
    ausgabe.write(i + '\t' + str(result))
    ausgabe.write('\n')
ausgabe.close

Reijarmo
  • 3
  • 3

1 Answers1

0

First, np.setdiff1d() is a somewhat odd way to remove items from a list. More typical would be to use a list comprehension:

stop_words = set(['an',v'as', 'art', 'ab', 'al'])
new_list = [item for item in x if item not in stop_words

Second, your code as currently shown then uses .join to re-composes all the words into one big string, with '\n' between them, and appends that one big string to a file.

So of course that's all that'll be in the file.

Also, that one big corpus_words string is not going to be a good argument for .get_vecattr(), which wants a single word key. (I'd expect your line model.wv.get_vecattr(corpus_words, "count") to KeyError before any printing-to-file is even attempted.)

There's nothing in your code as shown which would remove the '\n' characters, nor anything that would add the frequency numbers, nor re-read the file in any way or look up frquencies in any way. Is some of the code still missing?

Is your ultimate goal simply to have a text-file report of the 1,000 most common words, or to be able to look up individual frequencies in later code?

gojomo
  • 52,260
  • 14
  • 86
  • 115
  • I had tried the typical way first, but it did not remove the words. Therefore, then the rather untypical variant(I am psychologist no programmer, I ask therefore for pardon if it often seems unorthodox). it should be /n to make sure that every word is in a new line. iteration per line was missing in the code I am sorry, I added it now. The goal is to output the individual frequencies of the words in the corpus because I need them for a statistical calculation Reference was this post: https://stackoverflow.com/questions/37190989/how-to-get-vocabulary-word-count-from-gensim-word2vec – Reijarmo Jul 19 '21 at 11:19
  • Do you need the list of words, *without* the frequencies, in a file? If not, there's no need to write them to a file from a big `corpus_words` string you've added `'\n'` characters to - you already have the list you need to iterate over for your next step, in the `new_list` variable. (And even if you did want them in a file, doing it in `'a'` append-mode risks mistakenly adding the words to a leftover file from a previous run.) – gojomo Jul 19 '21 at 14:06
  • Thanks for the explanation, I will replace the 'a' with a 'w.' I want to use the words with the frequencies of the words – Reijarmo Jul 19 '21 at 15:03
  • Then I suggest you don't create `corpus_words` or write it to a separate file. Instead loop over your `new_list` to write the final file. – gojomo Jul 19 '21 at 16:54
  • 1
    Your suggestion to filter the list differently helped tremendously, but in my “.txt“ I see only the numbers not the words. Is there a straightforward way or do I have to call the word list and the number list as lists again and have them written side by side in a new file? Added the new code and thank you again so mutch for your help – Reijarmo Jul 20 '21 at 12:08
  • I only see a line of code in your loop to write the `result` value. If you want the word there too, you'll need to write it as well, either right before or after the number, perhaps with some other delimiter (like space or comma) between them. – gojomo Jul 20 '21 at 22:51
  • oh right, sorry that was stupid of me. Now everything works fine. Thank you so mutch for your help – Reijarmo Jul 21 '21 at 15:59
  • Sometimes it just takes another set of eyes! – gojomo Jul 21 '21 at 16:06