1

I'm sorry if there already is a similiar question, I kinda dont know the right word for what I'm looking for. Im looking for a solution in Python.

I got a text-database with around 200.000 words. Its tokenized, so its a list of words. What I want to do is to find out which words often occure together within a specific range (lets say 10 words). Bigrams dont do it for me as far as Ive seen, as they define "occuring together" as next to each other.

Thanks in advance

sorh
  • 125
  • 1
  • 11
  • The term you're looking for is n-grams. – Soviut Sep 27 '16 at 16:23
  • N-grams will give you different groups of words together, but you might be also be interested in [support, confidence, and lift](https://www-users.cs.umn.edu/~kumar/dmbook/ch6.pdf). Support measures how frequently an itemset (e.g. a pair of words) occurs in a greater body of data, while lift and confidence provide other "association rules." – blacksite Sep 27 '16 at 16:27
  • I don't think this should be marked as duplicate. OP obviously knows what n-grams are, and is asking about how frequently certain words tend to associate. – blacksite Sep 27 '16 at 16:31
  • Thank you, I will read into the document you provided and see if it answers my question – sorh Sep 27 '16 at 16:42
  • if the words are only in proximity but not necessarily next to each other, the term you are looking for is skip-grams (used for example by word2vec) – Suzana Aug 28 '20 at 14:13

0 Answers0