0

I am trying to use a text corpus file (One sentence by line) to extarct words co-occurrence from it in order to use them in a later traitement. So how can i extract word(statistical) co-occurrence from large corpus file using gensim and how to use them later ?

Bob Tabor
  • 21
  • 5
  • Can you please elaborate little bit more, like what kind of output you expect? – GraphicalDot Sep 18 '18 at 17:22
  • I want to get something like a co-occurrence graph in order to later use this co occurrence property of words (like the case in word2vec). – Bob Tabor Sep 18 '18 at 18:14
  • Actually, if using `gensim` to create word-vectors, its `Word2Vec` class just iterates over the entire training corpus in multiple passes, never creating a co-occurrence matrix. (Most other word2vec implementation I've seen, including Google's original `word2vec.c`, do the same. In contrast, I believe `GloVE` starts with a large co-occurrence matrix.) So if your true aim is creating word2vec word-vectors, you don't need a co-occurrence matrix, and if you want one for other purposes, I'm not sure if `gensim` exposes any convenient method for creating one. – gojomo Sep 18 '18 at 18:52
  • If you do need a co-occurrence matrix for other purposes, the exact helpful approaches vary based on the specifics of what you need: co-occurrence per text? within some sliding window? only when exactly adjacent? Classes in scikit-learn may be of interest, as in this answer – https://stackoverflow.com/questions/35562789/word-word-co-occurrence-matrix – or the plain-Python counting techniques of this answer – https://stackoverflow.com/questions/42814452/co-occurrence-matrix-from-list-of-words-in-python/42814963 – gojomo Sep 18 '18 at 18:59
  • Actually, the aim is not to create word2vec vectors, i want to use the co-occurrence property of words, is there a way to do that without a co-occurrence matrix ( isn't this the case in word2vec ?) – Bob Tabor Sep 18 '18 at 22:25
  • It's not yet clear what kind of co-occurrence you mean. Co-occurrence per text-of-any-length? Within some sliding window? Only when exactly adjacent? None of these are very hard in Python – see the linked answers above – but how to do it depends on a precise definition of what you want to do. – gojomo Sep 19 '18 at 01:09
  • Co-occurrence per sentence, i.e. when words appear in the same sentence withsome sliding window parameter. The main problem is the need of a lot of RAM, so I am looking for a way to get such co-occurence property without a need to a huge amount of RAM – Bob Tabor Sep 19 '18 at 09:13

0 Answers0