Remove non alphanumeric chars from string preserving accentuated chars

Question

I need to remove the chars such as "+", "/", "_" and similar from strings in order to perform a search method.

According to other question here, I had this using the gsub method, the problem is that it also substitutes the accentuated letters, which I don't want to:

string.gsub(/[^0-9A-Za-z]/, '')

EDIT: The languagues I need to support are spanish and catalonian.

Is there any way to adapt the expresion to preserve the letters with accents?

Which accents in particular do you wish to preserve? What languages are you dealing with? — Aaron Christiansen, May 28 '18 at 10:40
Spanish and Catalonian, so I need to support both acute and grave accents — Ernesto G, May 28 '18 at 10:49
`string.gsub(/[[:punct:]]/, '')` removes punctuation characters. — Stefan, May 28 '18 at 12:40

Aleksei Matiushkin · Accepted Answer · 2018-05-28T11:43:04.307

4

Both answers given here so far are plain wrong.

There are two types of accents in the modern unicode: composed and combined diacritics (decomposed.) With Ruby 2.3+ everything is easy:

"Barça".unicode_normalize(:nfc).scan(/\p{L}/)
#⇒ ["B", "a", "r", "ç", "a"]

The above will work no matter how “ç” was constructed, as a Latin1 composed character, or as a combined diacritics.

That said, to remove all non letters, one would do:

"Barça".unicode_normalize(:nfc).gsub(/[^\p{L}]/, '')

Before Ruby 2.3 there was no standard way to normalize a string to composed form, and while for “mañana” the simple range À..ÿ would work (composed form,) for “mañana” it won’t (combined diacritics.) You might ensure there is a difference yourself by copy-pasting both into your irb shell.

edited May 28 '18 at 11:43

answered May 28 '18 at 11:10

Aleksei Matiushkin

119,336
10
100
160

This answer seems much more sensible and comprehensive than mine. I'll have to remember this in future; thanks! – Aaron Christiansen May 28 '18 at 11:38
Thank you, I have to use `"Barça".unicode_normalize(:nfc).gsub(/[^\p{L}]/, '')` since I'm in Ruby 2.3.1, but it works. – Ernesto G May 28 '18 at 11:41
Oh, indeed, I mistakenly thought `unicode_normalize` was introduced in `2.5` only. Will update the answer. – Aleksei Matiushkin May 28 '18 at 11:42

radoAngelov · Answer 2 · 2018-05-28T11:22:25.427

1

You can also use a POSIX bracket expression. You will find all needed documentation in the ruby-docs.

In your case you can use either:

string.gsub(/[^[:alpha:]]/, '')

or:

string.gsub(/[^[:alnum:]]/, '')

From the documentation:

/[[:alnum:]]/ - Alphabetic and numeric character

/[[:alpha:]]/ - Alphabetic character

edited May 28 '18 at 11:22

answered May 28 '18 at 11:02

radoAngelov

684
5
12

Looks promising but it doesn't work, `"tórica".gsub(/[^[:alnum]]/, '')` returns `"a"` – Ernesto G May 28 '18 at 11:10
Sorry I am missing one semicolon `:` in the example. It is edited now. – radoAngelov May 28 '18 at 11:21
Thank you, that works. I will accept mudasobwa though since he answered before, still appreciated for your answer. – Ernesto G May 28 '18 at 11:40
@ErnestoG no, this does not work. Try `"mañana".gsub(/[^[:alpha:]]/, '')`. – Aleksei Matiushkin May 28 '18 at 11:41
You are right sorry. It is weird though, if I type "mañana" myself it works, if I copy your "mañana" it doesn't (it outputs "manana") – Ernesto G May 28 '18 at 11:43
2

@ErnestoG it’s not weird and I explained this behavior in my answer. When you type it using your keyboard, it spits out the _already combined_ value. Even the length of these strings differs: `%w|mañana mañana|.map(&:length) #⇒ [7, 6]`. That’s the reason of using `unicode_normalize(:nfc)` and that’s why I wrote other answers here are _wrong_. – Aleksei Matiushkin May 28 '18 at 11:51
Perfectly clear now, thank you for such an accurate answer – Ernesto G May 28 '18 at 11:54

Aaron Christiansen · Answer 3 · 2018-05-28T10:44:07.497

Borrowing from answers to this question, the regex character range for many, but not all, accented characters is À-ÿ. Therefore, to match these too, you can simply add this to your existing ranges:

string.gsub(/[^0-9A-Za-zÀ-ÿ]/, '')

It largely depends on the accents you're looking for, since there are too many accents to easily match all of them. This example regex will preserve for instance acute/grave accents, but misses crescents:

puts "I went to a café.".gsub(/[^0-9A-Za-zÀ-ÿ]/, '') # Iwenttoacafé
puts "Ahoj, světe!".gsub(/[^0-9A-Za-zÀ-ÿ]/, '')      # Ahojsvte

This might be fine for your use case, but if you're dealing with, say, Czech text, you might need additional character ranges to match crescents.

Try `"mañana"`. – Aleksei Matiushkin May 28 '18 at 11:16 — Aleksei Matiushkin, May 28 '18 at 11:16

Remove non alphanumeric chars from string preserving accentuated chars

3 Answers3