Is there a bi gram or tri gram feature in Spacy?

Question

The below code breaks the sentence into individual tokens and the output is as below

 "cloud"  "computing"  "is" "benefiting"  " major"  "manufacturing"  "companies"


import en_core_web_sm
nlp = en_core_web_sm.load()

doc = nlp("Cloud computing is benefiting major manufacturing companies")
for token in doc:
    print(token.text)

What I would ideally want is, to read 'cloud computing' together as it is technically one word.

Basically I am looking for a bi gram. Is there any feature in Spacy that allows Bi gram or Tri grams ?

@chirag. I have seen that solution. I think you are referring to this. https://stackoverflow.com/questions/39241709/how-to-generate-bi-tri-grams-using-spacy-nltk?rq=1. But it is a hack. It does not solve the problem head on.Not to mention so many additional lines of code in that noun chunk approach. — venkatttaknev, Dec 04 '18 at 09:28

score 14 · Accepted Answer · edited Aug 19 '20 at 11:21

14

Spacy allows the detection of noun chunks. So to parse your noun phrases as single entities do this:

Detect the noun chunks https://spacy.io/usage/linguistic-features#noun-chunks
Merge the noun chunks
Do dependency parsing again, it would parse "cloud computing" as single entity now.

>>> import spacy
>>> nlp = spacy.load('en')
>>> doc = nlp("Cloud computing is benefiting major manufacturing companies")
>>> list(doc.noun_chunks)
[Cloud computing, major manufacturing companies]
>>> for noun_phrase in list(doc.noun_chunks):
...     noun_phrase.merge(noun_phrase.root.tag_, noun_phrase.root.lemma_, noun_phrase.root.ent_type_)
... 
Cloud computing
major manufacturing companies
>>> [(token.text,token.pos_) for token in doc]
[('Cloud computing', 'NOUN'), ('is', 'VERB'), ('benefiting', 'VERB'), ('major manufacturing companies', 'NOUN')]

edited Aug 19 '20 at 11:21

tuomastik

4,559
5
36
48

answered Dec 04 '18 at 11:45

DhruvPathak

42,059
16
116
175

4

Thanks for your answer but the solution you provided is a 'way around' rather than a universal solution. Take an example of this text `doc = nlp("Big data cloud computing cyber security machine learning")`. It is not a coherent sentence but rather a collection of words. In this case I don't get cloud computing I get `['Big data cloud', 'cyber security machine learning']` – venkatttaknev Dec 04 '18 at 20:05
1

Because thats the way it is, it is trained on coherent sentences having good grammatical structure. What you are looking for is specifically like NER for which you would have to train your models for your use case. – DhruvPathak Dec 05 '18 at 06:32
Update 2022 you should use this instead now: `nlp.add_pipe("merge_noun_chunks")` – Yost777 Apr 02 '22 at 07:21

Suzana · Answer 2 · 2021-04-28T22:08:27.843

13

If you have a spacy doc, you can pass it to textacy:

ngrams = list(textacy.extract.basics.ngrams(doc, 2, min_freq=2))

edited Apr 28 '21 at 22:08

answered Mar 10 '20 at 09:56

Suzana

4,251
2
28
52

1

How would I do this for a list of documents? Just keep appending to the ngrams list? – Adit Sanghvi Dec 28 '20 at 23:08
@AditSanghvi Using `extend` or via list comprehension. – Suzana Dec 29 '20 at 15:34
Link is broken. Find the github and useful links [here](https://github.com/chartbeat-labs/textacy) – bendl Apr 19 '21 at 16:55
1

@bendl I updated the link and the module name – Suzana Apr 28 '21 at 22:10

score 5 · Answer 3 · edited Feb 02 '23 at 16:51

Warning: This is just an extension of the right answer made by Zuzana.

My reputation does not allow me to comment so I am making this answer just to answer the question of Adit Sanghvi above: "How do you do it when you have a list of documents?"

First you need to create a list with the text of the documents
Then you join the text lists in just one document
now you use the spacy parser to transform the text document in a Spacy document
You use the Zuzana's answer's to create de bigrams

This is the example code:

Step 1

doc1 = ['all what i want is that you give me back my code because i worked a lot on it. Just give me back my code']
doc2 = ['how are you? i am just showing you an example of how to make bigrams on spacy. We love bigrams on spacy']
doc3 = ['i love to repeat phrases to make bigrams because i love  make bigrams']
listOfDocuments = [doc1,doc2,doc3]
textList = [''.join(textList) for text in listOfDocuments for textList in text]
print(textList)

This will print this text:

['all what i want is that you give me back my code because i worked a lot on it. Just give me back my code', 'how are you? i am just showing you an example of how to make bigrams on spacy. We love bigrams on spacy', 'i love to repeat phrases to make bigrams because i love make bigrams']

then step 2 and 3:

doc = ' '.join(textList)
spacy_doc = parser(doc)
print(spacy_doc)

and will print this:

all what i want is that you give me back my code because i worked a lot on it. Just give me back my code how are you? i am just showing you an example of how to make bigrams on spacy. We love bigrams on spacy i love to repeat phrases to make bigrams because i love make bigrams

Finally step 4 (Zuzana's answer)

ngrams = list(textacy.extract.ngrams(spacy_doc, 2, min_freq=2))
print(ngrams)

will print this:

[make bigrams, make bigrams, make bigrams]

score 1 · Answer 4 · edited Nov 07 '20 at 08:52

I had a similar problem (bigrams, trigrams, like your "cloud computing"). I made a simple list of the n-grams, word_3gram, word_2grams etc., with the gram as basic unit (cloud_computing).

Assume I have the sentence "I like cloud computing because it's cheap". The sentence_2gram is: "I_like", "like_cloud", "cloud_computing", "computing_because" ... Comparing that your bigram list only "cloud_computing" is recognized as a valid bigram; all other bigrams in the sentence are artificial. To recover all other words you just take the first part of the other words,

"I_like".split("_")[0] -> I; 
"like_cloud".split("_")[0] -> like
"cloud_computing" -> in bigram list, keep it. 
  skip next bi-gram "computing_because" ("computing" is already used)
"because_it's".split("_")[0]" -> "because" etc.

To also capture the last word in the sentence ("cheap") I added the token "EOL". I implemented this in python, and the speed was OK (500k words in 3min), i5 processor with 8G. Anyway, you have to do it only once. I find this more intuitive than the official (spacy-style) chunk approach. It also works for non-spacy frameworks.

I do this before the official tokenization/lemmatization, as you would get "cloud compute" as possible bigram. But I'm not certain if this is the best/right approach.

Is there a bi gram or tri gram feature in Spacy?

4 Answers4

Linked