Dutch (or German) compound words in search functions (in PHP)

Question

I have been having an issue with building a search function for a while now that I'm building for a cooking blog.

In Dutch (similar to German), one can add as many compound words together to create a new word. This has been giving me a headache when wanting to include search results that include a relevant singular word inside compound words. It's kind of like a reverse Scunthorpe problem, I actually want to include certain words inside other words, but only sometimes.

For example, the word rice in Dutch is rijst. Brown rice is zilvervliesrijst and pandan rice is pandanrijst. If I want these two to pop up in search results, I have to search whether words exist inside a word, rather than whether they are the word.

However, this immediately causes issues for smaller words that can exist inside other words accidentally. For example, the word for egg is ei, while leek is prei. Onion is ui, while Brussel sprouts are spruitjes. You can see that accepting subsections of strings being matching the search strings could cause major problems.

I initially tried to grade what percentage of a word contains the search string, but this also causes issues as prei is 50% ei, while zilvervliesrijst is only about 25% rijst. This also makes using a levenshtein distance to solve this very impractical.

My current solution is as follows: I have an SQL table list of ingredients that are being used to automatically calculate the price and calorie total for each recipe based on the ingredient list, and I have used this to add all relevant synonyms to the name column. Basically, zilvervliesrijst is listed as zilvervliesrijst|rijst. I also use this to add both the plural and singular version of a term such that I will not have to test those.

However, this excludes any compound words in any place other than the ingredient list. Things such as title, cuisine, cooking equipment, dietary preferences and so on are still having this problem.

My question is this, is there a non-library-esque method that addresses this within the field of computer science? Or will I be doomed to include every single possible searchable compound word and its singular components, every time I want to add in a new recipe? I just hope that's not the case, as that will massively increase the processing time required for each additional library entry.

I think what you are looking for is generally referred to as "stemming" - reducing a word down to its "stem", or "lemma". This is a hard problem - natural languages are complex, full of exceptions and contradictions, so simple rules quickly fail. You are definitely better off looking for an existing full-text search package that has this functionality built in for the languages you're interested in; often these are built into data stores as a special kind of index. — IMSoP, Feb 07 '23 at 09:57
Hmm, that's unfortunate. The thing is that some recipes are relatively unique, so most title words will likely not be in there. Things like chocolate caramel cake would become chocoladekarameltaart, which is too specific to be in such libraries. Still, it's probably the best approach. But I'm definitely gonna look into stemming, maybe there are a few tips in that theory that could help me at least a little. Thanks! :) — Pepijn Ekelmans, Feb 07 '23 at 10:01
You don't need the tool to know the word "chocoladekarameltaart", only to consider "chocolade", "karamel", and "taart" as likely components when it analyses it. I don't know exactly how such tools work, but I'd be very surprised if a full-text search optimised for Dutch failed to recognise those as components. — IMSoP, Feb 07 '23 at 10:19
That's true, but there are some terms that don't exist in the Dutch dictionary. For example, lemon meringue pie is still often titled as lemon-meringuetaart. Thing is that this might end up having to become a balance between an algorithmic approach and an approach based on exceptions. — Pepijn Ekelmans, Feb 07 '23 at 10:44
The tools *already will be* that compromise. Yes, they might be missing some terms, but they might have a way to add those to the lexicon and have them interact with the algorithm. I really don't think this is something you want to build yourself, other than purely as a learning exercise; you want to "stand on the shoulders of giants" who have already spent weeks on the problem. — IMSoP, Feb 07 '23 at 10:49

rici · Answer 1 · 2023-02-08T00:41:58.200

I think it will be hard to do this well without using a library, and probably also a dictionary (which may be bundled as part of the library).

There are really two somewhat orthogonal problems:

Splitting compound words into their constituent parts.
Identifying the stem of a simple (non-compound) word. (For example, removing plural markers and inflections.) This is often called "stemming" but that's not really the best strategy; you'll also find the rather awkward term "lemmatization".

Both of these tasks are plagued with ambiguities in all the languages I know about. (A German example, taken from an Arxiv paper describing the German-language morphological analyser DEMorphy, is "Rohrohrzucker", which means "raw cane sugar" -- Roh Rohr Zucker -- but could equally be split into Rohr Ohr Zucker, pipe-ear sugar, if there were such a thing.)

The basic outline of how these tasks can be done in reasonable time (with lots of CPU power) is:

Using ngram analysis to figure out plausible word division points.
Lemmatize each candidate component word to get plausible POS (part-of-speech) markers.
Use a trained machine-learning model (or something of that form) to reject non-sensical (or at least highly improbable) divisions.
At each step, check possible corner cases in a dictionary (of corner cases).

That's just a rough outline, of course.

I was able to find, without too much trouble, a couple of fairly recent discussions of how to do this with Dutch words. I'm not even vaguely competent to discuss the validity of these papers, so I'll leave you to do the search yourself. (I used the search query "split compound words in Dutch".) But I can tell you two things:

The problem is being worked on, but not necessarily to produce freely-available products.
If you choose to tackle it yourself, you'll end up devoting quite a lot of time to the project, although you might find it interesting. If you do succeed, you'll end up with a useful product and the beginning of a thesis (perhaps useful if you have academic ambitions).

However you choose to do it, you're best off only doing it once for each new recipe. Analyse the contents of each recipe as it is entered, to build a list of search terms which you can store in your database along with the recipe. You will probably also want to split and lemmatize search queries, but those are generally short enough that the CPU time is reasonable. Even so, consider caching the analyses in order to save time on common queries.

Dutch (or German) compound words in search functions (in PHP)

1 Answers1