33

I'm trying to find out if there is a known algorithm that can detect the "key concept" of a sentence.

The use case is as follows:

  1. User enters a sentence as a query (Does chicken taste like turkey?)
  2. Our system identifies the concepts of the sentence (chicken, turkey)
  3. And it runs a search of our corpus content

The area that we're lacking in is identifying what the core "topic" of the sentence is really about. The sentence "Does chicken taste like turkey" has a primary topic of "chicken", because the user is asking about the taste of chicken. While "turkey" is a helper topic of less importance.

So... I'm trying to find out if there is an algorithm that will help me identify the primary topic of a sentence... Let me know if you are aware of any!!!

Jim Bolla
  • 8,265
  • 36
  • 54
rockit
  • 339
  • 1
  • 3
  • 4

12 Answers12

21

I actually did a research project on this and won two competitions and am competing in nationals.

There are two steps to the method:

  1. Parse the sentence with a Context-Free Grammar
  2. In the resulting parse trees, find all nouns which are only subordinate to Noun-Phrase-like constituents

For example, "I ate pie" has 2 nouns: "I" and "pie". Looking at the parse tree, "pie" is inside of a Verb Phrase, so it cannot be a subject. "I", however, is only inside of NP-like constituents. being the only subject candidate, it is the subject. Find an early copy of this program on http://www.candlemind.com. Note that the vocabulary is limited to basic singular words, and there are no verb conjugations, so it has "man" but not "men", has "eat" but not "ate." Also, the CFG I used was hand-made an limited. I will be updating this program shortly.

Anyway, there are limitations to this program. My mentor pointed out in its currents state, it cannot recognize sentences with subjects that are "real" NPs (what grammar actually calls NPs). For example, "that the moon is flat is not a debate any longer." The subject is actually "that the moon is flat." However, the program would recognize "moon" as the subject. I will be fixing this shortly.

Anyway, this is good enough for most sentences...

My research paper can be found there too. Go to page 11 of it to read the methods.

Hope this helps.

Michael
  • 227
  • 1
  • 2
  • 11
    The grammatical subject of a sentence is not the same as its topic. For example, in the middle of your answer you said: _I will be updating this program shortly._ Given its context, the topic of this sentence is _this program_, because this is what the sentence makes a statement _about_. However, the grammatical subject is _I_. – jogojapan Nov 05 '12 at 03:18
10

Most of your basic NLP parsing techniques will be able to extract the basic aspects of the sentence - i.e., that chicken and turkey a NPs and they are linked by and adjective 'like', etc. Getting these to a 'topic' or 'concept' is more difficult

Technique such as Latent Semantic Analysis and its many derivatives transform this information into a vector (some have methods of retaining in some part the hierarchy/relations between parts of speech) and then compares them to existing, usually pre-classified by concept, vectors. See http://en.wikipedia.org/wiki/Latent_semantic_analysis to get started.

Edit Here's an example LSA app you can play around with to see if you might want to pursue it further . http://lsi.research.telcordia.com/lsi/demos.html

dfb
  • 13,133
  • 2
  • 31
  • 52
  • +1 for LSA, even though it doesn't solve the OP's problem directly. – Fred Foo Apr 04 '11 at 21:46
  • LSA - only really helps to find the more unique words in the query. So if "chicken" appears in more documents than "turkey", "turkey" would be more likely to be visible in the top results.... – rockit Apr 04 '11 at 21:59
  • 1
    @rockit - LSA really doesn't have much to do with the unique words in a query. I think you're confusing the creation of the vector with LSA. In fact, some LSA variants don't even retain the numerosity of the word, just its presence. – dfb Apr 04 '11 at 22:03
  • 1
    The link to demo is broken. [Here](https://github.com/TheDataLeek/Python-LSA) is a working LSA. – dashesy Oct 05 '17 at 00:02
  • [gensim](https://github.com/RaRe-Technologies/gensim) has even more tools for LSA – dashesy Oct 05 '17 at 00:10
3

For many longer sentences its difficult to say what exactly is a topic and also there may be more than one.

One way to get approximate ans is

1.) First tag the sentence using openNLP, stanford Parser or any one. 2.) Then remove all the stop words from the sentence. 3.) Pick up Nouns( proper, singular and plural).

Other way is

1.) chuck the sentence into phrases by any parser. 2.) Pick up all the noun phrases. 3.) Remove the Noun phrases that doesn't have the Nouns as a child. 4.) Keep only adjectives and Nouns, remove all words from remaining Noun Phrases.

This might give approx. guessing.

Naveen
  • 773
  • 3
  • 17
  • 40
1

Compound or complex sentences may have more than one key concept of a sentence.

You can use stanfordNLP or MaltParser which can give the dependency structure of a sentence. It also gives the parts of speech tagging including subject, verb , object etc.

I think most of the times the object will be the key concept of the sentence.

Naveen
  • 773
  • 3
  • 17
  • 40
1

You should look at Google's Cloud Natural Language API. It's their NLP service.

https://cloud.google.com/natural-language/

Tom
  • 17,103
  • 8
  • 67
  • 75
1

"Key concept" is not a well-defined term in linguistics, but this may be a starting point: parse the sentence, find the subject in the parse tree or dependency structure that you get. (This doesn't always work; for example, the subject of "Is it raining?" is "it", while the key concept is likely "rain". Also, what's the key concept in "Are spaghetti and lasagna the same thing?")

This kind of problem (NLP + search) is more properly dealt with by methods such as LSA, but that's quite an advanced topic.

Fred Foo
  • 355,277
  • 75
  • 744
  • 836
  • @rockit: I wasn't the one who linked to the other question. It seems that what you want is doable up to a point. – Fred Foo Apr 04 '11 at 21:32
  • Very interesting, but I'm not working with a set of documents - only the query! – rockit Apr 04 '11 at 21:33
  • 1
    Bloody hell, "Is it raining?" was the first example I wanted to write. (As opposed to "I've just seen 2012."/"Is it interesting?"/"Not really.) But I'll throw in my second one, which is quite appropriately: "How dare you?" – biziclop Apr 04 '11 at 21:33
  • @biziclop, @rockit: "Empty" subjects may be avoided by filtering for words like "it" and choosing the main verb's object, or maybe even the verb itself, as the "key concept". – Fred Foo Apr 04 '11 at 21:35
  • @rockit - FWIW - you're not going to find a easy solution for this one. Most of these techniques rely on having a corpus of training data. – dfb Apr 04 '11 at 21:36
  • @spinning_plate: Parsers for English with pretrained models are widely available (NLTK, Stanford NLP). – Fred Foo Apr 04 '11 at 21:36
  • I have a number of long queries in my log files, will that work as a training set of data? – rockit Apr 04 '11 at 21:37
  • @larsmans - I'm talking about the topic classification, not the parser – dfb Apr 04 '11 at 21:37
  • @spinning_plate: but if search is the ultimate objective, then classification is not needed. The OP might want to lend extra weight to subject words and see if the retrieval performance goes up. – Fred Foo Apr 04 '11 at 21:39
  • I am also able to categorize the topics in the sentence by type. An example would be "food items", "drink items", "condiments", "silverware", etc. – rockit Apr 04 '11 at 21:39
  • @larsman - This is probably true, and probably the best solution to try off the bat. I was just mentioning that getting a ready-made solution for getting the 'topic' isn't something that is readily available in a toolkit to my knowledge. – dfb Apr 04 '11 at 21:42
  • @rockit: If I understand correctly that you want this for a search app, you might want to consider the following: parse your data, store and index fragments of parse trees as well as words, parse the query, search for fragments of parse trees as well as words. It's expensive in terms of storage, but a pretty common technique to get question answering working. – Fred Foo Apr 04 '11 at 21:42
  • @larsmans: My examples show that this isn't a very precise approach. "Is it raining?" and "Is it interesting?" are fundamentally different. Although, given certain conditions, it could be good enough. But in some cases, even choosing the 3 longest words might be good enough too. – biziclop Apr 04 '11 at 21:43
  • Isn't LSA similar to TF/IDF? How will that help in this instance - especially if "chicken" occurs in more documents than "turkey". In that case "turkey" would be ranked higher.... – rockit Apr 04 '11 at 21:54
  • The grammatical subject of a sentence is a syntactic (i.e. structural) concept. It's got nothing to do with what the _topic_ of a sentence is. Also, what are "methods such as LSA"? – jogojapan Nov 05 '12 at 03:24
1

On the most basic level, a question in English is usually in the form of <verb> <subject> ... ? or <pronoun> <verb> <subject> ... ?. This is by no means a good algorithm, especially considering that the subject could span several words, but depending on how sophisticated a solution you need, it might be a useful starting point.

If you need precision, ignore this answer.

biziclop
  • 48,926
  • 12
  • 77
  • 104
  • "Is it true that whales are mammals?" :p – Fred Foo Apr 04 '11 at 21:44
  • @larsmans On the most basic level... Trouble is, we have no idea what kind of precision the OP had in mind. Although the subject of this question is really "it", which refers to the second part of the compound sentence. – biziclop Apr 04 '11 at 21:47
  • Actually in English "the most basic level" is really extremely basic contrary to other languages, so that probably won't help much for RL problems – Voo Apr 04 '11 at 22:29
  • @Voo I don't have much hope either, it was more for the sake of showing a complete spectrum of options of different complexity and efficiency. This approach ranks pretty low on that spectrum. – biziclop Apr 04 '11 at 22:39
1

If you're willing to shell out money, http://www.connexor.com/ is supposed to be able to do this type of semantic analysis for a wide variety of languages, including English. I have never directly used their product, and so can't comment on how well it works.

btilly
  • 43,296
  • 3
  • 59
  • 88
1

There's an article about Parsing Noun Phrases in the MIT Computational Linguistics journal of this month: http://www.mitpressjournals.org/doi/pdf/10.1162/COLI_a_00076

ZeoS
  • 27
  • 1
  • 3
0

Simple solution is to tag your sentence with part-of-speach tagger (e.g. from NLTK library for Python) then find matches with some predefined part-of-speach patterns in which it's clear where is main subject of the sentence

Andrey Sboev
  • 7,454
  • 1
  • 20
  • 37
  • I doubt that a set of chunking patterns on POS tags are able to reliably find the subject of a sentence. Besides, parsers that can do this are available. – Fred Foo Apr 04 '11 at 21:33
  • My sentence is gramatically tagged with OpenNLP - but not for the subject of the sentence – rockit Apr 04 '11 at 21:52
0

One option is to look into something like this as a first step:

http://www.abisource.com/projects/link-grammar/

But how you derive the topic from these links is another problem in itself. But as Abiword is trying to detect grammatical problems, you might be able to use it to determine the topic.

Glenn
  • 5,334
  • 4
  • 28
  • 31
-3

By "primary topic" you're referring to what is termed the subject of the sentence.

The subject can be identified by understanding a sentence through natural language processing.

The answer to this question is the same as that for How to determine subject, object and other words? - this is a currently unsolved problem.

Community
  • 1
  • 1
Jon Cram
  • 16,609
  • 24
  • 76
  • 107
  • It's pretty close to being solved, in the sense that for well-studied languages, parsers are on a par with professional linguists. – Fred Foo Apr 04 '11 at 21:30
  • Probably. But the question you linked to is more generalized and has only answers directing the asker to do more research. – rockit Apr 04 '11 at 21:32
  • 1
    The grammatical subject of a sentence is definitely not the same as its topic. – jogojapan Nov 05 '12 at 03:24