javascript regex treats swedish characters as special charachters and matches incorrectly

Question

I am currently working on a JavaScript feature that involves highlighting search results. Specifically, I want to implement a functionality where searching for a word, such as 'sea', within a sentence such as 'the sea causes me nausea in this season' will result in the word 'sea' and any instances where it acts as a prefix like the word 'season' to be higlighted. However, I do not want to highlight occurrences of 'sea' when it appears as a postfix like in the word 'nausea' nor when it is in the middle of a word like 'disease'.

To achieve this, I am using the regular expression /\bsea/gmi, which works perfectly with English characters. However, it fails to produce the desired results when applied to Swedish characters, like 'ä', 'å', and 'ö'. For example, if the search word is 'gen', the postfix 'gen' in the word 'vägen' is incorrectly highlighted. It seems that the regular expression treats these characters as special characters or something similar. I even tried adding unicode modifier u but that didt't help either.

Since my expertise lies mainly in C#, I'm not familiar with how JavaScript behaves in this context. I would greatly appreciate any insights or guidance on how JavaScript handles these situations or how to work around this problem.

markalex · Accepted Answer · 2023-05-24T06:28:11.260

3

Javascript's regex engine doesn't change behavior of \b depending on presence of u flag. But luckily you can imitate it using Unicode property classes.

In this exact case your regex would look like this: /(?<![\p{L}\p{N}_])gen/gmiu.

Here we check (using negative lookbehind) that gen is not immediately preceded by any of:

\p{L}: letter (in any language),
\p{N}: digit (in any language)
_.

Basically [\p{L}\p{N}_] is alternative to \w with considering of u flag. Please notice that this is default behavior in some other regex engines, for example PCRE.

Demo here.

And in general case \b can be replaced with /(?<![\p{L}\p{N}_])(?=[\p{L}\p{N}_])|(?<=[\p{L}\p{N}_])(?![\p{L}\p{N}_])/gmu.

Demo here.

edited May 24 '23 at 06:28

answered May 24 '23 at 06:14

markalex

8,623
2
7
32

i had fixed it by changing the \b into (\s|\n|^) , i tried playing a lil bit with unicode properties but looks like i missed some stuff. I'll implement this instead. Thank you! – Mohamad Hammash May 24 '23 at 11:38
@MohamadHammash, be advised that `\s|\n|^` doesn't include quotes, for example. And many more punctuation marks. Could your troubles with this solution be caused by missing `u` flag in your attempts? – markalex May 24 '23 at 12:00
1

@MohamadHammash, one more thing: `\s` includes `\n`. So your regex is effectively `\s|^` – markalex May 24 '23 at 12:21

score -2 · Answer 2 · answered May 24 '23 at 05:35

You can change your regular express to handle Swedish Characters like following:

const searchTerm = 'sea';
const sentence = 'the sea causes me nausea in this season vägen';

const pattern = new RegExp(`\\b${searchTerm}|\\b${searchTerm}[äåöÄÅÖ]\\w*`, 'gmi');
const highlightedSentence = sentence.replace(pattern, (match) => `<mark>${match}</mark>`);

console.log(highlightedSentence);

\b${searchTerm}[äåöÄÅÖ]\w* matches the word 'sea' followed by a Swedish character
The gmi is used to perform global search
The mark tag is used to highlight the text

Suggested regex doesn't handle word `vägen` with search term `gen` correctly. Consider this [demo](https://regex101.com/r/NnwDUc/1) — markalex, May 24 '23 at 06:22

javascript regex treats swedish characters as special charachters and matches incorrectly

2 Answers2