You can use a Unicode text segmentation library to iterate over grapheme clusters, and check that the first rune in each grapheme cluster has the right category (letter or digit).
import (
"strings"
"unicode"
"github.com/rivo/uniseg"
)
func stripSpecial(s string) string {
var b strings.Builder
gr := uniseg.NewGraphemes(s)
for gr.Next() {
r := gr.Runes()[0]
if unicode.IsLetter(r) || unicode.IsDigit(r) {
b.WriteString(gr.Str())
}
}
return b.String()
}
The code works by first breaking the string into grapheme clusters,
"cafè!?" -> ["c", "a", "f", "è", "!", "?"]
Each grapheme cluster may contain multiple Unicode code points. The first code point determines the type of character, and the remaining code points (if any) are accent marks or other modifiers. So we filter and concatenate:
["c", "a", "f", "è"] -> "cafè"
This will pass through any accented or unaccented letters and digits, no matter how they are normalized, and no matter what accents (including z̶̰̬̰͈̅̒̚͝å̷̢̡̦̼̥̘̙̺̩̮̱̟̳̙͂́̇̓̉́͒̎͜ḽ̷̢̣̹̳̊̋ͅg̵̙̞͈̥̳̗͙͚͛̀͘o̴̧̟̞̞̠̯͈͔̽̎͋̅́̈̅̊̒ text). It will exclude certain characters like zero-width joiners which will cause words in certain languages to get mangled... so if you care about an international audience, you may want to review if your audience uses zero-width joiners. So, this will mangle certain scripts like Devanagari.