15

Is there a way to to find all the sub-sentences of a sentence that still are meaningful and contain at least one subject, verb, and a predicate/object?

For example, if we have a sentence like "I am going to do a seminar on NLP at SXSW in Austin next month". We can extract the following meaningful sub-sentences from this sentence: "I am going to do a seminar", "I am going to do a seminar on NLP", "I am going to do a seminar on NLP at SXSW", "I am going to do a seminar at SXSW", "I am going to do a seminar in Austin", "I am going to do a seminar on NLP next month", etc.

Please note that there is no deduced sentences here (e.g. "There will be a NLP seminar at SXSW next month". Although this is true, we don't need this as part of this problem.) . All generated sentences are strictly part of the given sentence.

How can we approach solving this problem? I was thinking of creating annotated training data that has a set of legal sub-sentences for each sentence in the training data set. And then write some supervised learning algorithm(s) to generate a model.

I am quite new to NLP and Machine Learning, so it would be great if you guys could suggest some ways to solve this problem.

Nicolas Kaiser
  • 1,628
  • 2
  • 14
  • 26
Golam Kawsar
  • 760
  • 1
  • 8
  • 21
  • In your example, do you also want trivial subsentences like "I am going" and "I am"? How about "I am going to Austin next month"? – Adrian McCarthy Jan 23 '12 at 17:15
  • @Adrian McCarthy: "I am going to Austin next month" would fall in the "deduced sentences" as described in the question. These are not desired here, as they imply a semantic treatment of the input sentence whereby, as I understand it, the idea is just to include/exclude various combinations of qualifying prepositional phrases from the original text. – mjv Jan 23 '12 at 18:17
  • 1
    @Adrian McCarthy: you raised a nice point. The sub-sentence "I am going to Austin" falls somewhat on a borderline between deduced sentence and a "strict" sub-sentence. But since the requirement is to list only the sub-sentences that are strictly found in the sentence, we would skip this sentence. – Golam Kawsar Jan 23 '12 at 18:32
  • @mjv: Actually, no, my example does not require semantics to deduce a sentence. One approach to your problem would be to enumerate all possible substrings and test each one to see if it's grammatical. That would discover "I am going to Austin next month." If you want to omit such a sentence from the desired subsentences, then we need a more precise definition of what you're after. – Adrian McCarthy Jan 23 '12 at 18:33
  • I think one way to formulate the problem is that the subsentenc has to contain the main verb of the main sentence ("do" in this case) as the main verb of the subsentence itself. I admit, the definition of the problem is not very precise as stated. – Golam Kawsar Jan 23 '12 at 18:40
  • So now you have two problems. From what I understand of English grammar, _am_ is the main verb of the sample sentence and _going to do a seminar_ is a gerund phrase functioning as a noun. We're getting farther and farther from an understanding of the problem. – Adrian McCarthy Jan 23 '12 at 21:24
  • @AdrianMcCarthy: OK, ignoring what I said involving the main verb of the initial sentence, I just realized if we relax the requirement and include even the deducible sub-sentences then our original asked set of sub-sentences become a subset of this new set. In that case we can go for finding out *all* sub-sentences involving the main subject of the sentence (we can simplify by requiring there is only one main subject). Does that sound more quantifiable in an algorithm? – Golam Kawsar Jan 24 '12 at 04:48

4 Answers4

11

You can use dependency parser provided by Stanford CoreNLP. Collapsed output of your sentence will look like below.

nsubj(going-3, I-1)
xsubj(do-5, I-1)
aux(going-3, am-2)
root(ROOT-0, going-3)
aux(do-5, to-4)
xcomp(going-3, do-5)
det(seminar-7, a-6)
dobj(do-5, seminar-7)
prep_on(seminar-7, NLP-9)
prep_at(do-5, -11)
prep_in(do-5, Austin-13)
amod(month-15, next-14)
tmod(do-5, month-15)

The last 5 of your sentence output are optional. You can remove one or more parts that are not essential to your sentence.
Most of this optional parts are belong to prepositional and modifier e.g : prep_in, prep_do, advmod, tmod, etc. See Stanford Dependency Manual.

For example, if you remove all modifier from the output, you will get

I am going to do a seminar on NLP at SXSW in Austin.

Khairul
  • 1,483
  • 1
  • 13
  • 23
  • But it does not give me the list of *all* possible sentences. I mean it might be hidden in this dependency output, but I need a systematic way to extract those sentences. – Golam Kawsar Jan 24 '12 at 04:54
  • 3
    Of course it didn't. But you can extract all possible sentence. Start with listing **all optional parts**. Then try all combination to remove of those optional part. – Khairul Jan 24 '12 at 05:23
  • Is there a guarantee that it will *always* generate syntactically and semantically valid sentences? – Golam Kawsar Jan 24 '12 at 19:19
  • 1
    As long as you remove the **the optional** part, the sentences should be valid both syntactically and semantically. The problem now is how to define what are optional, what aren't. – Khairul Jan 25 '12 at 01:11
  • How do you define which parts are optional? For example, for the sentence "I am going to meet the software engineer who did a brilliant presentation on NLP in New York last month", which parts are optional and which part are not? Are you saying only the prepositional and modifier dependency relations are optional? If yes, why so? – Golam Kawsar Jan 25 '12 at 01:37
  • I'm not really a linguists. You can check books about grammar and sentence structure. Maybe you can ask this question in http://english.stackexchange.com – Khairul Jan 25 '12 at 01:56
  • From your example above, all part except **I meet engineer** are optional. Parts of sentence are **optional** if you remove them, the sentence still give you the intended information. – Khairul Jan 29 '12 at 07:24
  • how i can remove the modifiers, is there any API for it ! – S Gaber Mar 12 '12 at 06:49
  • You can remove modifier based on dependency name. They're **amod, tmod, etc**. Please see http://nlp.stanford.edu/software/dependencies_manual.pdf – Khairul Mar 13 '12 at 05:16
6

There's a paper titled "Using Discourse Commitments to Recognize Textual Entailment" by Hickl et al that discusses the extraction of discourse commitments (sub-sentences). The paper includes a description of their algorithm which in some level operates on rules. They used it for RTE, and there may be some minimal levels of deduction in the output. Text simplification maybe a related area to look at.

Victoria Stuart
  • 4,610
  • 2
  • 44
  • 37
Kenston Choi
  • 2,862
  • 1
  • 27
  • 37
5

The following paper http://www.mpi-inf.mpg.de/~rgemulla/publications/delcorro13clausie.pdf processes the dependencies from the Stanford parser and contructs simple clauses (text-simplification).

See the online demo - https://d5gate.ag5.mpi-sb.mpg.de/ClausIEGate/ClausIEGate

Bolaka
  • 475
  • 6
  • 6
2

One approach would be with a parser such as a PCFG. Trying to just train a model to detect 'subsentences' is likely to suffer from data sparsity. Also, I am doubtful that you could write down a really clean and unambiguous definition of a subsentence, and if you can't define it, you can't get annotators to annotate for it.

bmargulies
  • 97,814
  • 39
  • 186
  • 310