1

Does anyone know, what the german regex for/ with umlauts is?

const string = 'Aktivitäten und Ausflugsziele in der Nähe von keyword.'

function getTags ( string ) {
   let tags = []
   string = string.toLocaleLowerCase()
   tags = string.match(/\b(\w+)\b/g)
   return tags
}

This regex /\b(\w+)\b/g does work perfect. However umlauts result in something like that..

[ 'aktivit', 'ten', 'und', 'ausflugsziele', 'in', 'der', 'he', 'von', 'keyword' ]

Now I tried to use this regex. /\b(\w+[0-9a-zäöüÄÖÜ])\b/g, which seems to get closer to the expected result, but somehow I cant find the end of the word.

[ 'aktivitä', 'ten', 'und', 'ausflugsziele', 'in', 'der' 'nä', 'he', 'von', 'keyword' ]

Does anyone know the correct regex to fix german umlauts? Expected output:

[ 'aktivitäten', 'und', 'ausflugsziele', 'in', 'der, 'nähe', 'von', 'keyword' ]
VebDav
  • 152
  • 6

5 Answers5

4

In a first iteration, you could try /\b([0-9a-zA-ZäöüÄÖÜß]+)\b/g. Note that I added A-Z and ß to your character set and applied the + (one or more reps) quantifier to it. This will fail for edge cases because the word boundaries only work for \w, which doesn't include Umlaute - Äpfel and won't work because they start/end with an Umlaut. Additionally, what about different languages? What about a french "è"? I propose the following, simple regex:

\p{L}+ - one or more Unicode letters; you might want to include the digit unicode property as well; note also that you need the unicode flag here.

You must get rid of the word boundaries. This is however not an issue because the greedy matching ensures no cutting inside words happens.

If you want to limit yourself to the German alphabet, you can use [a-zäöüß]+ instead (you have already lowercased the string).

Luatic
  • 8,513
  • 2
  • 13
  • 34
  • Thanks for pointing any other non-ascii character out. I mean my simple example has no letters like à or é or something like that. So I decided to choose the unicode-version `/(\p{L}+)/gu` - and this works perfect. – VebDav Oct 08 '22 at 14:29
2

Change /\b(\w+[0-9a-zäöüÄÖÜ])\b/g to this /[0-9a-zäöüÄÖÜ]+/g. + means find one or more characters from [0-9a-zäöüÄÖÜ] and space isn't in it so it stops when find first space and looks for another word.

const string = 'Aktivitäten und Ausflugsziele in der Nähe von keyword.'

function getTags(string) {
     let tags = []
     string = string.toLocaleLowerCase()
     tags = string.match(/[0-9a-zäöüÄÖÜ]+/g)
      return tags
}

console.log(getTags(string))
Lukas
  • 2,263
  • 1
  • 4
  • 15
2

I suggest you switch to using Unicode regex which is widely supported by browsers today. That means all unicode characters are supported, not just Umlauts.

Use this regex:

/(?<=^|[^\p{L}\p{N}])[\p{L}\p{N}]+(?=[^\p{L}\p{N}]|$)/gu

Note the unicode flag. Neither \w nor \b supports unicode characters, so we use unicode look arounds.

Explanation:

(?<=^|[^\p{L}\p{N}]) - look behind for start of string OR any character not being in unicode category {Letter} or {Number}

[\p{L}\p{N}]+ - match a character belonging to unicode category {Letter} OR {Number}, one or more

(?=[^\p{L}\p{N}]|$) - look ahead for any character not being in unicode category {Letter} or {Number} OR end of string

Poul Bak
  • 10,450
  • 5
  • 32
  • 57
1

Better solution than my first one, if you use a recent browser (or node interpreter): the \p{L} expression.

So

function getTags ( string ) {
    let tags = []
    string = string.toLocaleLowerCase()
    tags = string.match(/(\p{L}+)/gu)
    return tags
}

Old answer for record

\w is strictly for ids, that is [a-zA-Z0-9_].

That being said, you can separate words by spaces, using [^\s\.,] for a more general version of \w (anything that is not made of spaces), and using (?<=\s|^) and (?=\s|$) for replacements of \b. Meaning, there is a space, or a beginning of line before, and there is a space or a end of line after, respectively.

So, all together,

const string = 'Ich denke daß es gut ist. Aktivitäten und Ausflugsziele in der Nähe von keyword. Erdoğan had a ğ in his name.' 

function getTags ( string ) {
    let tags = []
    string = string.toLocaleLowerCase()
    tags = string.match(/(?<=\s|^)([^ ]+)(?=\s|$)/g)
    return tags
}

console.log(getTags(string));

Note that it works even with other letters that the one we may think of at first (contrarily to solution based on some [äöüß...]) and also with words that starts with one of those non-ascii letter, contrarily to solutions that still use \b. Replacing \w is not enough if first or last letter of the word is non-\w letter.

chrslg
  • 9,023
  • 5
  • 17
  • 31
0

You can be specific as to what you don't want and use .replace(). The example is filtering with the following Regexp:

const rgx = /[.,!?;:@#$%*&_=+(){}|'"`~/[\\\^\-\]]/

Note: \, ^, -, and ] are prefixed with \ which is necessary to escape them within a class (ie when between [...]). Also, you might want to remove - since some words may be hyphenated.

const string = 'Aktivitäten und Ausflugsziele in der Nähe von keyword.';

const getTags = str => str
  .toLowerCase()
  .replace(/[.,!?;:@#$%*&_=+(){}|'"`~/[\\\^\-\]]/, '')
  .split(' ');

console.log(getTags(string));
zer00ne
  • 41,936
  • 6
  • 41
  • 68