4

Say I have the sentence: "John likes to take his pet lamb in his Lamborghini Huracan more than in his Lamborghini Gallardo" and I have a dictionary containing "Lamborghini", "Lamborghini Gallardo" and "Lamborghini Huracan". What's a good way of extracting the bold terms, achieving the terms "Lamborghini Gallardo" and "Lamborghini Huracan" as phrase matches, and other partial matches "Lamborghini" and "lamb"? Giving preference to the phrase matches over individual keywords.

Elastic search provides exact term match, match phrase, and partial matching. Exact term would obviously not work here, and neither match phrase since the whole sentence is considered as phrase in this case. I believe partial match would be appropriate if I only had the keywords of interest in the sentence. Going through previous SO threads, I found proximity for relevance which seems relevant, although not sure if this is the 'best option' since requires setting a threshold. Or even if there are simpler / better alternatives than elasticsearch (which seems more for full text search rather than simple keyword matching to a database)?

dter
  • 1,065
  • 2
  • 11
  • 22

1 Answers1

4

It sounds like you'd like to perform keyphrase extraction from your documents using a controlled vocabulary (your dictionary of industry terms and phrases).

[Italicized terms above to help you find related answers on SO and Google]


This level of analysis takes you a bit out of the search stack into the natural-language processing stack. Since NLP tends to be resource-intensive, it tends to take place offline, or in the case of search-applications, at index-time.

To implement this, you'd:

  1. Integrate a keyphrase extraction tool, into your search-indexing code to generate a list of recognized key phrases for each document.
  2. Index those key phrases as shingles into a new Elasticsearch field.
  3. Include this shingled keyphrase field in the list of fields searched at query-time — most likely with a score boost.

For a quick win tool to help you with controlled keyphrase extraction, check out KEA (written in java).

(You could also probably write your own, but if you're also hoping to extract uncontrolled key phrases (not in dictionary) as well, a trained extractor will serve you better. More tools here.)

Community
  • 1
  • 1
Peter Dixon-Moses
  • 3,169
  • 14
  • 18
  • Thank you for your informative reply Peter. Since my vocabulary contains keywords and phrases I want to match with (rather than documents), shingles would not be required for the database no? I was thinking of using shingles the other way round though; create shingles from the user search terms, then either perform 'normal' exact matching queries with these shingles to identify bigrams or trigrams, or use partial matching and boost score for the longest matched shingles (done by default). That way I can also check for spelling mistakes as well as identify keyphrases. Does this make sense? – dter Sep 14 '16 at 21:23
  • 1
    Correct. The shingles would help you on the query side to avoid matching single terms within your keyphrases. You could probably do something similar with phrase queries, but since you can't control what terms are entered by searchers, maybe shingles would get you closer to where you want to be. – Peter Dixon-Moses Sep 14 '16 at 21:52
  • This way I'm hoping shingles would enable bigrams/trigram matching, fuzzy filter would allow for typos, boosting would give the closest match, and synonyms well to match synonyms. Wonder if this approach is simpler and more/equally) effective as training a named entity recognition model... – dter Sep 14 '16 at 22:08