8

Since I was told Spacy was such a powerful Python module for natural speech processing, I am now desperately looking for a way to group words together to more than noun phrases, most importantly, prepositional phrases. I doubt there is a Spacy function for this but that would be the easiest way I guess (SpacySpaCy import is already implemented in my project). Nevertheless, I'm open for any possibility of phrase recognition/ chunking.

Serenity
  • 35,289
  • 20
  • 120
  • 115
Malte Ge
  • 153
  • 3
  • 12
  • Can you give an example of what you want specifically? Maybe like an example input with the desired output corresponding to it. – Harrison Aug 23 '16 at 12:07
  • 1
    Of course. As a translation of a German input, take a sentence like "How long does it take me to drive to the university?" (in German "Wie lange brauche ich bis zur Uni?"). I want "to [PREP] the [DET] university [NOUN]" to be chunked as a prepositional phase by either knowing remotely what a prepositional phrase consists of or by stating exact rules (PP -> PREP + NP) like used in other python modules. As SpaCy is used for tagging in my program and seems to only support noun chunking I would like to have a supporting module or just a function inside it to recognize additional chunks. – Malte Ge Aug 23 '16 at 13:25

1 Answers1

9

Here's a solution to get PPs. In general you can get phrases using subtree.

def get_pps(doc):
    "Function to get PPs from a parsed document."
    pps = []
    for token in doc:
        # Try this with other parts of speech for different subtrees.
        if token.pos_ == 'ADP':
            pp = ' '.join([tok.orth_ for tok in token.subtree])
            pps.append(pp)
    return pps

Usage:

import spacy

nlp = spacy.load('en_core_web_sm')
ex = 'A short man in blue jeans is working in the kitchen.'
doc = nlp(ex)

print(get_pps(doc))

This prints:

['in blue jeans', 'in the kitchen']
Emiel
  • 343
  • 6
  • 14
  • Where's the `nlp()` function from? – Hamman Samuel Aug 08 '18 at 03:25
  • 1
    I've updated the answer. `nlp` refers to a loaded SpaCy instance (following the convention from the SpaCy docs: https://spacy.io/usage/). – Emiel Aug 09 '18 at 07:23
  • Thanks, I ran into another issue with `spacy.load('en')`, which was fixed by replacing it with `spacy.load('en_core_web_sm')`, solution is from spaCy's GitHub issue tracker discussion https://github.com/explosion/spaCy/issues/1721#issuecomment-373241198 – Hamman Samuel Aug 09 '18 at 15:43
  • Hey, I was wondering if anyone knows how to apply this to a df? – JassiL Feb 20 '20 at 16:01
  • Generally speaking, you can create a new column based on values from another column using the `apply` method. For example: `df['b'] = df['a'].apply(len)` will create a new column (with label `'b'`) based on the values in the column with label `'a'`, using the built-in `len` function. In other words: the second column will hold the lengths of the items in column `'a'`. You can use any function you like, including the one in the answer. But if the column holds strings, then you do need to process the strings first. – Emiel Feb 22 '20 at 10:52