2

I have a sample String like á, é, í, ó, ú, ü, ñ and I want to replace the special characters, for example :
á with a
é with e
and so on..

I have a map where I have special character as key and its corresponding replacement as value.
Now suppose I'll pass a String "novás músíc" into method where a regex will validate it and if any special char is found (the one which I mentioned) then it should be replaced with the mapped char.

Please help me with regex validation part.

Deduplicator
  • 44,692
  • 7
  • 66
  • 118
Sharique
  • 781
  • 5
  • 12
  • 33
  • 1
    Can you show your code? – Sven Hohenstein Feb 15 '15 at 08:59
  • 4
    You understand that these are **not** "special" characters, right? And that `novás` is *misspelled* if you change it to `novas` instead? It's 2015, it's completely unnecessary and inappropriate in today's world to force languages to conform to the English alphabet. – T.J. Crowder Feb 15 '15 at 09:04
  • A regex is not the right tool to replace a set of characters one by one in a string. It is more efficient and less complex to iterate over the characters and replace the one character if needed. – vanje Feb 15 '15 at 09:06
  • 1
    @T.J.Crowder there are valid use cases for this, for example I've used it when implementing a search tool - the strings I show to users are always the original ones, but internally I normalise both the documents and the queries so a user whose keyboard doesn't do accents can perform a search without accents and find documents with and vice versa. – Ian Roberts Feb 15 '15 at 09:43
  • 1
    @IanRoberts: Absolutely, a small number of very limited use cases. But this pervasive belief that these characters are in some way "special" is best refuted barring such a case being cited. – T.J. Crowder Feb 15 '15 at 09:48
  • In Danish, one would (when forced) replace "å" with "aa". Search libraries could match å to aa and aa to å with a higher weight than å to a and a to a. – Tom Blodget Feb 15 '15 at 15:27

2 Answers2

3

You can do this via Unicode normalization, followed by a regular expression to remove the ligature marks.

See this question and its accepted answer: "Convert Unicode to ASCII without changing the string length (in Java)"

Community
  • 1
  • 1
Jherico
  • 28,584
  • 8
  • 61
  • 87
-1

You can use this regex [^0x00-0x7F]

String source=args[0];
Pattern p = Pattern.compile("[^0x00-0x7F]");
Matcher m = p.matcher(source);

if(map.containsKey(m.group()){
//Replace with the value here
}
else{
//put a default value for all
}

This is just based on the little information provided in your question. You would need to elaborate more to get a more detailed answer. This regex would only check for ASCII values(list here)

santiago92
  • 413
  • 2
  • 9