Java regex for NonAsciiCharacters

Question

I'm using this little snippet.

string.replaceAll("[^\\p{ASCII}]","")

I want to delete or remove the nonAsciiCharacters but i have a problem for example the following string is getting rip

final String myString = "cada dia es más cercano a Dios.";

but the á is getting remove and this is the 225 Ascii character i thought that this regex will replace all the NON-ASCII but á is ascii character why is this?

Maybe i get it all wrong.

https://stackoverflow.com/questions/3322152/is-there-a-way-to-get-rid-of-accents-and-convert-a-whole-string-to-regular-lette — Reimeus, Jan 10 '19 at 17:45

Karol Dowbecki · Accepted Answer · 2019-01-10T18:06:24.673

0

á (a-acute) is not part of ASCII character set. It's a Unicode Character 'LATIN SMALL LETTER A WITH ACUTE' (U+00E1) character and part of the Latin-1 Supplement UTF-8 block.

You can see it by running:

"á".codePoints()
   .mapToObj(Integer::toHexString)
   .forEach(System.out::println); // e1

To keep á you can either specifically white-list this character in the pattern

string.replaceAll("[^\\p{ASCII}á]", "")

or white-list a larger group e.g. p{L} which contains all letters

edited Jan 10 '19 at 18:06

answered Jan 10 '19 at 17:49

Karol Dowbecki

43,645
9
78
111

and there is a way to escape all the nonAscii characters but keep the á? sorry – chiperortiz Jan 10 '19 at 17:58
@chiperortiz updated answer but the solution will depend on what do you want to do – Karol Dowbecki Jan 10 '19 at 18:06
you guess it right thanks mate. best regards from venezuela – chiperortiz Jan 10 '19 at 18:10

Java regex for NonAsciiCharacters

1 Answers1