4

I have a use case where I want to extract main meaningful part of the sentence using spacy or nltk or any NLP libraries.

Example sentence1: "How Can I raise my voice against harassment" Intent would be: "raise voice against harassment"

Example sentence2: "Donald Duck is created by which cartoonist/which man/whom ?" Intent would be: "Donald duck is created by"

Example sentence3: "How to retrieve the main intent of a sentence using spacy or nltk" ? Intent: "retrieve main intent of sentence using spacy nltk"

I am new to dependency parsing and don't exactly know how to do this. Please help me.

  • It is not clear what you can intent. In the industry, intent is associated with multiple phrase that describe the same intention. Here you seems to want to extract a noun phrase. The question is under specified. – amirouche Feb 29 '20 at 12:49

1 Answers1

12

TL;DR

You have to define the ultimate task you want to perform and define what exactly is "intent" / "main information" or "meaning of text".

In Long

From first look, it seems like you're asking to solve a natural language problem magically. But lets look at the question and what you're really asking, lets avoid all the notion of intent/labels or language (for a while) and just look at what's the in-/output:

[in]:  "How Can I raise my voice against harassment"
[out]: "raise voice against harassment"

[in]:  "Donald Duck is created by which cartoonist/which man/whom ?" 
[out]: "Donald duck is created by"

[in]:  "How to retrieve the main intent of a sentence using spacy or nltk ?" 
[out]: "retrieve main intent of sentence using spacy nltk"

It seems like all your output tokens/words are just a quote from your input, in that case what if you simply treat your problem as a "span/sequence annotation" task, i.e.

[in]:  "How Can I raise my voice against harassment"
[out]: [0, 0, 0, 1, 0, 1, 1, 1] 

[in]:  "Donald Duck is created by which cartoonist/which man/whom ?" 
[out]: [1, 1, 1, 1, 0, 0, 0]

[in]:  "How to retrieve the main intent of a sentence using spacy or nltk ?" 
[out]: [0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

Assuming that each word is a binary label, the output should label 1 for the words that you want to extract from the input and 0 for the ones you don't.

Now given it's a simple binary sequence labeling task, one could simply do:

But step back a little,

  • Is that really true that an intent be always part of the input?
  • What exactly is an intent? How is it defined?
  • What happens if the intent is not in the input?

Okay, even if we don't talk about "intent" and just want to extract the main meaning,

  • what exactly is meaning of the sentence, is it just extracting the "important words"? If so, what makes the words "important"? How is "important" defined?
  • Are only non-stop words not important? If so, then you can simply remove stopwords, e.g. Stopword removal with NLTK. And also, what then are stopwords?

But I heard people doing it with dependency parsing

What is dependency parsing?

In short, it provides a structured representation of text. But non of the structure in traditional dependency formalism has notion of "intent".

Proof: CTR + F on https://web.stanford.edu/~jurafsky/slp3/15.pdf

So I don't think just simply parsing the text with dependency trees would help unless the notion of "intent" is better defined in your scenario.

How about this SpaCy tool that trains a model for intent?

From https://github.com/explosion/spaCy/blob/master/examples/training/train_intent_parser.py

Yes, that's an example of using a combination parsing labels and sequence labeling and defining that as "intent", more specifically, we see examples from https://github.com/explosion/spaCy/blob/master/examples/training/train_intent_parser.py#L31

TRAIN_DATA = [
    (
        "find a cafe with great wifi",
        {
            "heads": [0, 2, 0, 5, 5, 2],  # index of token head
            "deps": ["ROOT", "-", "PLACE", "-", "QUALITY", "ATTRIBUTE"],
        },
    ),
    (
        "find a hotel near the beach",
        {
            "heads": [0, 2, 0, 5, 5, 2],
            "deps": ["ROOT", "-", "PLACE", "QUALITY", "-", "ATTRIBUTE"],
        },
    ),

Each training data is made up of

  1. text
  2. the index of the dependency head
  3. the "intent" labels related to the dependency head

And an example in/outputs from https://github.com/explosion/spaCy/blob/master/examples/training/train_intent_parser.py#L173

[in]:  find a hotel with good wifi
[out]:
    [
      ('find', 'ROOT', 'find'),
      ('hotel', 'PLACE', 'find'),
      ('good', 'QUALITY', 'wifi'),
      ('wifi', 'ATTRIBUTE', 'hotel')
    ]

The example above shows that the whole list of triplets are defined as an intent, rather than just the raw strings. The triplets refers to the (dependent, relation, head), e.g. the hotel is the PLACE to find from the triplets ('hotel', 'PLACE', 'find').

Note: This is solely SpaCy notion of "semantics" or "intent" which is not wrong but well-defined and hence a model to perform this task is trainable in a supervised machine learning paradigm. Details, see https://spacy.io/usage/examples

Depending on how and what you define as intent/semantics, the in/outputs will change and the model to train may be different.

But why do you have to make it so complicated, I just want the intent string?!

Because what does "main meaning" or "intent" mean if it's just a string?

We go back to the lack of definition that makes the task a magical one rather than one that computers can perform.

alvas
  • 115,346
  • 109
  • 446
  • 738
  • 2
    Very good (and short) description of fairly complex topic. For the minimal 'intent' of a sentence you could explore dependency parsing using Subject-Verb-Object model. It's quick & dirty but works fairly well on corpus of similar documents (a lot depends on what kind of language your documents represent). – Alex16237 Mar 03 '20 at 13:44