4

How can I find the singular in the plural when some letters change?

Following situation:

  • The German word Schließfach is a lockbox.
  • The plural is Schließfächer.

As you see, the letter a has changed in ä. For this reason, the first word is not a substring of the second one anymore, they are "regex-technically" different.

Maybe I'm not in the right corner with my chosen tags below. Maybe Regex is not the right tool for me. I've seen naturaljs (natural.NounIflector()) provides this functionality out of the box for English words. Maybe there are also solutions for the German language in the same way?

What is the best approach, how can I find singular in the plural in German?

Wai Ha Lee
  • 8,598
  • 83
  • 57
  • 92
  • did you try regex with flag 'u' ?? (https://javascript.info/regexp-unicode) – Robert Nov 12 '20 at 14:10
  • Of course, I tried it: https://regex101.com/r/6fSyqw/1 –  Nov 12 '20 at 14:16
  • why you don't find schließfächer and then remove by replace all german special signs ? – Robert Nov 12 '20 at 14:21
  • Why should I do that if there is a better solution I do not know yet? –  Nov 12 '20 at 14:22
  • ok. i don't think so, but maybe someone will surprise me. you can remove/ replace this signs before searching. – Robert Nov 12 '20 at 14:26
  • 1
    did you saw this ? [String.prototype.normalize()](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/normalize) – Robert Nov 12 '20 at 14:28
  • Technically schließfach and schließfächer are not the same, schließfaech would be more correct, the actual english equivalent is actually Schliessfaech. So what you're asking for is actually quite subjective and not as you say a problem others would have encounterd. Perhaps what you want is https://github.com/JakeBayer/FuzzySharp. Mind you I haven't tried it with the test cases you have supplied. – Mick Nov 12 '20 at 23:58
  • I've edited the question, obviously, it was misleading,.. –  Nov 13 '20 at 07:05
  • Does this need to work with fantasy words? I.e. Blumtächer -> Blumtach ? If not, have a look at a dictionary approach. – chefhose Jun 08 '21 at 11:50
  • @chefhose an excellent question; maybe there is a hybrid solution? –  Jun 08 '21 at 11:53

2 Answers2

7

I once had to build a text processor that parsed many languages, including very casual to very formal. One of the things to identify was if certain words were related (like a noun in the title which was related to a list of things - sometimes labeled with a plural form.)

IIRC, 70-90% of singular & plural word forms across all languages we supported had a "Levenshtein distance" of less than 3 or 4. (Eventually several dictionaries were added to improve accuracy because "distance" alone produced many false positives.) Another interesting find was that the longer the words, the more likely a distance of 3 or fewer meant a relationship in meaning.

Here's an example of the libraries we used:

const fastLevenshtein = require('fast-levenshtein');

console.log('Deburred Distances:')
console.log('Score 1:', fastLevenshtein.get('Schließfächer', 'Schließfach'));
// -> 3
console.log('Score 2:', fastLevenshtein.get('Blumtach', 'Blumtächer'));
// -> 3
console.log('Score 3:', fastLevenshtein.get('schließfächer', 'Schliessfaech'));
// -> 7
console.log('Score 4:', fastLevenshtein.get('not-it', 'Schliessfaech'));
// -> 12
console.log('Score 5:', fastLevenshtein.get('not-it', 'Schiesse'));
// -> 8


/**
 * Additional strategy for dealing with other various languages:
 *   "Deburr" the strings to omit diacritics before checking the distance:
 */

const deburr = require('lodash.deburr');
console.log('Deburred Distances:')
console.log('Score 1:', deburr(fastLevenshtein.get('Schließfächer', 'Schließfach')));
// -> 3
console.log('Score 2:', deburr(fastLevenshtein.get('Blumtach', 'Blumtächer')));
// -> 3
console.log('Score 3:', deburr(fastLevenshtein.get('schließfächer', 'Schliessfaech')));
// -> 7


// Same in this case, but helpful in other similar use cases.
Dan Levy
  • 1,214
  • 11
  • 14
2

You can use a stemmer (which is in fact a lemmatizer) from the nlp.js library, which has models for 40 languages.

const { StemmerDe } = require('@nlpjs/lang-de');

const stemmer = new StemmerDe();
console.log(stemmer.stemWord('Schließfach'));
console.log(stemmer.stemWord('Schließfächer'));
Jindřich
  • 10,270
  • 2
  • 23
  • 44