0

I need to remove the chars such as "+", "/", "_" and similar from strings in order to perform a search method.

According to other question here, I had this using the gsub method, the problem is that it also substitutes the accentuated letters, which I don't want to:

string.gsub(/[^0-9A-Za-z]/, '')

EDIT: The languagues I need to support are spanish and catalonian.

Is there any way to adapt the expresion to preserve the letters with accents?

Ernesto G
  • 525
  • 6
  • 20

3 Answers3

4

Both answers given here so far are plain wrong.

There are two types of accents in the modern unicode: composed and combined diacritics (decomposed.) With Ruby 2.3+ everything is easy:

"Barça".unicode_normalize(:nfc).scan(/\p{L}/)
#⇒ ["B", "a", "r", "ç", "a"]

The above will work no matter how “ç” was constructed, as a Latin1 composed character, or as a combined diacritics.

That said, to remove all non letters, one would do:

"Barça".unicode_normalize(:nfc).gsub(/[^\p{L}]/, '')

Before Ruby 2.3 there was no standard way to normalize a string to composed form, and while for “mañana” the simple range À..ÿ would work (composed form,) for “mañana” it won’t (combined diacritics.) You might ensure there is a difference yourself by copy-pasting both into your irb shell.

Aleksei Matiushkin
  • 119,336
  • 10
  • 100
  • 160
1

You can also use a POSIX bracket expression. You will find all needed documentation in the ruby-docs.

In your case you can use either:

string.gsub(/[^[:alpha:]]/, '')

or:

string.gsub(/[^[:alnum:]]/, '')

From the documentation:

/[[:alnum:]]/ - Alphabetic and numeric character

/[[:alpha:]]/ - Alphabetic character

radoAngelov
  • 684
  • 5
  • 12
  • Looks promising but it doesn't work, `"tórica".gsub(/[^[:alnum]]/, '')` returns `"a"` – Ernesto G May 28 '18 at 11:10
  • Sorry I am missing one semicolon `:` in the example. It is edited now. – radoAngelov May 28 '18 at 11:21
  • Thank you, that works. I will accept mudasobwa though since he answered before, still appreciated for your answer. – Ernesto G May 28 '18 at 11:40
  • @ErnestoG no, this does not work. Try `"mañana".gsub(/[^[:alpha:]]/, '')`. – Aleksei Matiushkin May 28 '18 at 11:41
  • You are right sorry. It is weird though, if I type "mañana" myself it works, if I copy your "mañana" it doesn't (it outputs "manana") – Ernesto G May 28 '18 at 11:43
  • 2
    @ErnestoG it’s not weird and I explained this behavior in my answer. When you type it using your keyboard, it spits out the _already combined_ value. Even the length of these strings differs: `%w|mañana mañana|.map(&:length) #⇒ [7, 6]`. That’s the reason of using `unicode_normalize(:nfc)` and that’s why I wrote other answers here are _wrong_. – Aleksei Matiushkin May 28 '18 at 11:51
  • Perfectly clear now, thank you for such an accurate answer – Ernesto G May 28 '18 at 11:54
0

Borrowing from answers to this question, the regex character range for many, but not all, accented characters is À-ÿ. Therefore, to match these too, you can simply add this to your existing ranges:

string.gsub(/[^0-9A-Za-zÀ-ÿ]/, '')

It largely depends on the accents you're looking for, since there are too many accents to easily match all of them. This example regex will preserve for instance acute/grave accents, but misses crescents:

puts "I went to a café.".gsub(/[^0-9A-Za-zÀ-ÿ]/, '') # Iwenttoacafé
puts "Ahoj, světe!".gsub(/[^0-9A-Za-zÀ-ÿ]/, '')      # Ahojsvte

This might be fine for your use case, but if you're dealing with, say, Czech text, you might need additional character ranges to match crescents.

Aaron Christiansen
  • 11,584
  • 5
  • 52
  • 78