Questions tagged [collocation]

Anything related to collocations, i.e. sequences of words in text that often appears together. This is a term widely used in linguistics and this tag should be used for related questions.

Anything related to collocations, i.e. sequences of words in text that often appears together. This is a term widely used in linguistics and this tag should be used for related questions.

See Wikipedia on collocations.

43 questions
33
votes
10 answers

Forming Bigrams of words in list of sentences with Python

I have a list of sentences: text = ['cant railway station','citadel hotel',' police stn']. I need to form bigram pairs and store them in a variable. The problem is that when I do that, I get a pair of sentences instead of words. Here is what I…
Hypothetical Ninja
  • 3,920
  • 13
  • 49
  • 75
15
votes
3 answers

NLTK collocations for specific words

I know how to get bigram and trigram collocations using NLTK and I apply them to my own corpora. The code is below. I'm not sure however about (1) how to get the collocations for a particular word? (2) does NLTK have a collocation metric based on…
Sabba
  • 561
  • 2
  • 6
  • 15
7
votes
2 answers

How to get n-gram collocations and association in python nltk?

In this documentation, there is example using nltk.collocations.BigramAssocMeasures(), BigramCollocationFinder,nltk.collocations.TrigramAssocMeasures(), and TrigramCollocationFinder. There is example method find nbest based on pmi for bigram and…
Fahmi Rizal
  • 137
  • 2
  • 9
4
votes
1 answer

nltk quadgram collocation finder

I am seeing mulitple questions and answers saying that NLTK collocation cannot be done beyond bi and tri grams. example this one - How to get n-gram collocations and association in python nltk? I am seeing that there is a something called…
Kumar
  • 1,017
  • 1
  • 11
  • 16
4
votes
2 answers

NLTK: Find contexts of size 2k for a word

I have a corpus and I have a word. For each occurrence of the word in the corpus I want to get a list containing the k words before and the k words after the word. I am doing this algorithmically OK (see below) but I wondered whether NLTK is…
Zakum
  • 2,157
  • 2
  • 22
  • 30
3
votes
1 answer

2 word phrase collocations using quanteda in R

This is regarding the textstat_collocations functionality in quanteda package in R. I am getting more than 2 word phrases in the output even though I am requesting only for the 2 word phrases. The necessary processing steps are as follows (corpus1…
ds_newbie
  • 79
  • 8
3
votes
3 answers

How to get PMI scores for trigrams with NLTK Collocations? python

I know how to get bigram and trigram collocations using NLTK and I apply them to my own corpora. The code is below. My only problem is how to print out the birgram with the PMI value? I search NLTK documentation multiple times. It's either I'm…
Sabba
  • 561
  • 2
  • 6
  • 15
2
votes
1 answer

How to deep merge two collections by duplicate key in JavaScript/Lodash?

I would like to merge two collections by duplicate key in javascript, here is example collections: let collection1 = [ { title: 'Overview', key: 'Test-overview', isLeaf: true }, { title: 'Folder 1', …
Fred
  • 35
  • 4
2
votes
1 answer

How to convert pandas data frame in list of words for nltk-collocation-finder?

As a linguist and a python-beginner I want to find word-collocations in my own (german) tweet-corpus. How can I convert the tweets from a pandas dataframe (just one column = tweet) into a list of words to then be able to use the…
2
votes
1 answer

How to use "collocation_list" function on my corpus in Python?

I'm new in Python and try to import my own corpus to find collocations in its texts. I'm using Python 3.7.5. and followed instructions of the textbook by Bird, Klein and Loper. However, when I try to use "collocation_list" on the whole corpus the…
Gavrk
  • 295
  • 1
  • 4
  • 16
2
votes
1 answer

Count ngram word frequency using text collocations

I would like to count the frequency of three words preceding and following a specific word from a text file which has been converted into tokens. from nltk.tokenize import sent_tokenize from nltk.tokenize import word_tokenize from nltk.util import…
Mike Ninov
  • 23
  • 3
2
votes
0 answers

Python NLTK collocation for roman numerals

As there is a collocation for numbers in nltk such as ('RS', '##number##') I'm wondering if there is such a specifier for Roman numerals which I want to use for something like this: ('volume', '##roman number##') If there is no way to do such a…
eightnoteight
  • 234
  • 2
  • 11
2
votes
0 answers

collocation data from phone calls

I have thousands of phone calls on a daily basis converted from speech to text. I tried generating collocation data using the two options below OPTION # 1 corpus.collocations(200,2) OPTION # 2 bigram = nltk.collocations.BigramAssocMeasures() finder…
Naresh MG
  • 633
  • 2
  • 11
  • 19
1
vote
1 answer

quanteda collocations and lemmatization

I am using the Quanteda suite of packages to preprocess some text data. I want to incorporate collocations as features and decided to use the textstat_collocations function. According to the documentation and I quote: "The tokens object . . . .…
Cola4ever
  • 189
  • 1
  • 1
  • 16
1
vote
1 answer

How to reapply collocation_list() to my data?

I have spent hours trying to get identify collocations in my data. When I run the NLTK example text4.collocation_list() ...it works. But when I directly thereafter try to apply it to my own data, I get the following error message: Traceback (most…
Lindsay
  • 25
  • 2
1
2 3