16

The below code breaks the sentence into individual tokens and the output is as below

 "cloud"  "computing"  "is" "benefiting"  " major"  "manufacturing"  "companies"


import en_core_web_sm
nlp = en_core_web_sm.load()

doc = nlp("Cloud computing is benefiting major manufacturing companies")
for token in doc:
    print(token.text)

What I would ideally want is, to read 'cloud computing' together as it is technically one word.

Basically I am looking for a bi gram. Is there any feature in Spacy that allows Bi gram or Tri grams ?

venkatttaknev
  • 669
  • 1
  • 7
  • 21
  • @chirag. I have seen that solution. I think you are referring to this. https://stackoverflow.com/questions/39241709/how-to-generate-bi-tri-grams-using-spacy-nltk?rq=1. But it is a hack. It does not solve the problem head on.Not to mention so many additional lines of code in that noun chunk approach. – venkatttaknev Dec 04 '18 at 09:28

4 Answers4

14

Spacy allows the detection of noun chunks. So to parse your noun phrases as single entities do this:

  1. Detect the noun chunks https://spacy.io/usage/linguistic-features#noun-chunks

  2. Merge the noun chunks

  3. Do dependency parsing again, it would parse "cloud computing" as single entity now.

>>> import spacy
>>> nlp = spacy.load('en')
>>> doc = nlp("Cloud computing is benefiting major manufacturing companies")
>>> list(doc.noun_chunks)
[Cloud computing, major manufacturing companies]
>>> for noun_phrase in list(doc.noun_chunks):
...     noun_phrase.merge(noun_phrase.root.tag_, noun_phrase.root.lemma_, noun_phrase.root.ent_type_)
... 
Cloud computing
major manufacturing companies
>>> [(token.text,token.pos_) for token in doc]
[('Cloud computing', 'NOUN'), ('is', 'VERB'), ('benefiting', 'VERB'), ('major manufacturing companies', 'NOUN')]
tuomastik
  • 4,559
  • 5
  • 36
  • 48
DhruvPathak
  • 42,059
  • 16
  • 116
  • 175
  • 4
    Thanks for your answer but the solution you provided is a 'way around' rather than a universal solution. Take an example of this text `doc = nlp("Big data cloud computing cyber security machine learning")`. It is not a coherent sentence but rather a collection of words. In this case I don't get cloud computing I get `['Big data cloud', 'cyber security machine learning']` – venkatttaknev Dec 04 '18 at 20:05
  • 1
    Because thats the way it is, it is trained on coherent sentences having good grammatical structure. What you are looking for is specifically like NER for which you would have to train your models for your use case. – DhruvPathak Dec 05 '18 at 06:32
  • Update 2022 you should use this instead now: `nlp.add_pipe("merge_noun_chunks")` – Yost777 Apr 02 '22 at 07:21
13

If you have a spacy doc, you can pass it to textacy:

ngrams = list(textacy.extract.basics.ngrams(doc, 2, min_freq=2))
Suzana
  • 4,251
  • 2
  • 28
  • 52
5

Warning: This is just an extension of the right answer made by Zuzana.

My reputation does not allow me to comment so I am making this answer just to answer the question of Adit Sanghvi above: "How do you do it when you have a list of documents?"

  1. First you need to create a list with the text of the documents

  2. Then you join the text lists in just one document

  3. now you use the spacy parser to transform the text document in a Spacy document

  4. You use the Zuzana's answer's to create de bigrams

This is the example code:

Step 1

doc1 = ['all what i want is that you give me back my code because i worked a lot on it. Just give me back my code']
doc2 = ['how are you? i am just showing you an example of how to make bigrams on spacy. We love bigrams on spacy']
doc3 = ['i love to repeat phrases to make bigrams because i love  make bigrams']
listOfDocuments = [doc1,doc2,doc3]
textList = [''.join(textList) for text in listOfDocuments for textList in text]
print(textList)

This will print this text:

['all what i want is that you give me back my code because i worked a lot on it. Just give me back my code', 'how are you? i am just showing you an example of how to make bigrams on spacy. We love bigrams on spacy', 'i love to repeat phrases to make bigrams because i love make bigrams']

then step 2 and 3:

doc = ' '.join(textList)
spacy_doc = parser(doc)
print(spacy_doc)

and will print this:

all what i want is that you give me back my code because i worked a lot on it. Just give me back my code how are you? i am just showing you an example of how to make bigrams on spacy. We love bigrams on spacy i love to repeat phrases to make bigrams because i love make bigrams

Finally step 4 (Zuzana's answer)

ngrams = list(textacy.extract.ngrams(spacy_doc, 2, min_freq=2))
print(ngrams)

will print this:

[make bigrams, make bigrams, make bigrams]

RHC
  • 342
  • 2
  • 13
iair linker
  • 89
  • 1
  • 5
1

I had a similar problem (bigrams, trigrams, like your "cloud computing"). I made a simple list of the n-grams, word_3gram, word_2grams etc., with the gram as basic unit (cloud_computing).

Assume I have the sentence "I like cloud computing because it's cheap". The sentence_2gram is: "I_like", "like_cloud", "cloud_computing", "computing_because" ... Comparing that your bigram list only "cloud_computing" is recognized as a valid bigram; all other bigrams in the sentence are artificial. To recover all other words you just take the first part of the other words,

"I_like".split("_")[0] -> I; 
"like_cloud".split("_")[0] -> like
"cloud_computing" -> in bigram list, keep it. 
  skip next bi-gram "computing_because" ("computing" is already used)
"because_it's".split("_")[0]" -> "because" etc.

To also capture the last word in the sentence ("cheap") I added the token "EOL". I implemented this in python, and the speed was OK (500k words in 3min), i5 processor with 8G. Anyway, you have to do it only once. I find this more intuitive than the official (spacy-style) chunk approach. It also works for non-spacy frameworks.

I do this before the official tokenization/lemmatization, as you would get "cloud compute" as possible bigram. But I'm not certain if this is the best/right approach.

halfer
  • 19,824
  • 17
  • 99
  • 186
user9165100
  • 371
  • 3
  • 11