Remove all special characters but not accented letters

Question

I need to delete from a string all symbols exept accentend letters in GO. My code instead delete all symbols included accented letters:

str := "cafè!?"
reg, err := regexp.Compile(`[^\w]`)
str := reg.ReplaceAllString(str, " ")

I expect the following output:

cafè

But the output with my code is:

caf

I want to include è, é, à, ò, ì (and of course all letters from a to z and numbers from 0 to 9)

How can I do? Thanks for your help

Could you provide example input and output? What do you want to include as an "accented letter"? Is a Chinese character, like 的, a letter, or do you want to exclude it? — Dietrich Epp, Dec 13 '21 at 16:52
I think `\w` include accented letters, would you clarify accented letters ? — Moein Kameli, Dec 13 '21 at 16:55
@MoeinKameli, in Go [it does not](https://pkg.go.dev/regexp/syntax). — kostix, Dec 13 '21 at 16:55
I think the question begs further clarifications. First, in unicode, there are lots of accents; second, there are two ways to have accented character; third, accents are part of a broader class of the so-called diacritical marks, which, for instance, includes cedilla. I'm not sure we we can robustly cover point 2 above with just REs alone—we'd need to do the so-called "normalization" first—such as [NFC](https://unicode.org/reports/tr15/#Norm_Forms). IoW, I would say there's the need to learn some things about Unicode first, and then to formulate a more precise question. — kostix, Dec 13 '21 at 17:02
…But may it be than we're dealing with [an XY Problem](https://meta.stackexchange.com/a/66378) here, and the real case was something like "the need to detect weird characters in a string one should not use in a password" or something like this? — kostix, Dec 13 '21 at 17:03
Sorry for lack of info, I just added something more specific, I need to do a grep on a long string, but the substring 'cafè!', and 'cafè' are different for example, so I've tried to delete symbols before — Ella, Dec 13 '21 at 17:10
does [this](https://stackoverflow.com/questions/20690499/concrete-javascript-regex-for-accented-characters-diacritics) answer your question? — Moein Kameli, Dec 13 '21 at 17:15
Well, thinking of the question a bit more, I'm inclined to think that Elisa got tripped by the fact in Go, `\w` matches _ASCII_ "word characters"—as [explicitly stated in the docs](https://pkg.go.dev/regexp/syntax). I'd also say that `\w` is not a good way to match "letters" because that token was invented by Perl to mean "word character", and there it meant symbols which can be used in the identifiers of the C programming language, and thus it includes `_` as well as ASCII letters. Thus I think the OP actually wants the "Letters" Unicode character class. The regexp will read `[^\pL]` then. — kostix, Dec 13 '21 at 17:54
More info on [Unicode character classes](https://en.wikipedia.org/wiki/Unicode_character_property). — kostix, Dec 13 '21 at 18:01
Not clear on what you mean by _`accented letters`_. Extended ASCII, or do you intend to manipulate Unicode ? — sln, Dec 13 '21 at 19:25

score 1 · Answer 1 · answered Dec 13 '21 at 17:17

1

To include è, é, à, ò, ì, just add them to the regex: [^\wèéàòìÈÉÀÒÌ]

You might also use [^\d\p{Latin}], but that'll match more characters.

\d is for digits and \p{Latin} is a Unicode class for all Latin characters, including all diacritics.

For example:

re := regexp.MustCompile(`[^\d\p{Latin}]`)
fmt.Println(re.ReplaceAllString(`Test123éËà-ŞŨğБла通用`, ""))

Will print:

Test123éËàŞŨğ

answered Dec 13 '21 at 17:17

rustyx

80,671
25
200
267

This will not work correctly on decomposed characters. For example, café could be written `"caf\u00e9"` or it could be written `"cafe\u0301"`. – Dietrich Epp Dec 13 '21 at 17:25
2

Of course not. But that's not the question, is it? – rustyx Dec 13 '21 at 17:27
The string `"e\u0301"` is one character. The way the question is worded, I would expect it to be treated equivalently to the string `"\u00e9"`, which is also one character (the same character). – Dietrich Epp Dec 13 '21 at 17:30
1

Technically you're correct, but `e\u0301` is extremely rarely used, so much so that it can be safely omitted. The output in the question also demonstrates the `é` being dropped in its entirety, suggesting a single code point `\u00e9`. – rustyx Dec 13 '21 at 17:40

Wiktor Stribiżew · Answer 2 · 2021-12-15T12:52:58.580

1

All "special" characters here are punctuation (and I assume also symbol) chars, so use

[\p{P}\p{S}]+

If you want to remove any chars but any letters you need to use

\P{L}+

See regex demo #1 and regex demo #2. Here,

\p{P} matches any punctuation proper (like commas, dots)
\p{S} symbols, as mathematical, etc. symbols
\P{L} - any char other than a Unicode letter.

edited Dec 15 '21 at 12:52

answered Dec 13 '21 at 22:59

Wiktor Stribiżew

607,720
39
448
563

addendum: unicode char classes can also be written without curly brackets when the class name is a single letter: `\pP\pS` – blackgreen Dec 14 '21 at 07:01
2

@blackgreen However, `\p{X}` is a more portable way, e.g. in ECMAScript 2018+ `\pP` would not work. – Wiktor Stribiżew Dec 14 '21 at 07:59

Dietrich Epp · Answer 3 · 2021-12-13T17:33:27.510

You can use a Unicode text segmentation library to iterate over grapheme clusters, and check that the first rune in each grapheme cluster has the right category (letter or digit).

import (
    "strings"
    "unicode"

    "github.com/rivo/uniseg"
)

func stripSpecial(s string) string {
    var b strings.Builder
    gr := uniseg.NewGraphemes(s)
    for gr.Next() {
        r := gr.Runes()[0]
        if unicode.IsLetter(r) || unicode.IsDigit(r) {
            b.WriteString(gr.Str())
        }
    }
    return b.String()
}

The code works by first breaking the string into grapheme clusters,

"cafè!?" -> ["c", "a", "f", "è", "!", "?"]

Each grapheme cluster may contain multiple Unicode code points. The first code point determines the type of character, and the remaining code points (if any) are accent marks or other modifiers. So we filter and concatenate:

["c", "a", "f", "è"] -> "cafè"

This will pass through any accented or unaccented letters and digits, no matter how they are normalized, and no matter what accents (including z̶̰̬̰͈̅̒̚͝å̷̢̡̦̼̥̘̙̺̩̮̱̟̳̙͂́̇̓̉́͒̎͜ḽ̷̢̣̹̳̊̋ͅg̵̙̞͈̥̳̗͙͚͛̀͘o̴̧̟̞̞̠̯͈͔̽̎͋̅́̈̅̊̒ text). It will exclude certain characters like zero-width joiners which will cause words in certain languages to get mangled... so if you care about an international audience, you may want to review if your audience uses zero-width joiners. So, this will mangle certain scripts like Devanagari.

Remove all special characters but not accented letters

3 Answers3