7

I'm generating a XML file to make payments and I have a constraint for user's full names. That param only accept alphabet characters (a-ZAZ) + whitespaces to separe names and surnames.

I'm not able to filter this in a easy way, how can I build a regular expression or a filter to get my desireable output?

Example:

'Carmen López-Delina Santos' must be 'Carmen LopezDelina Santos'

I need to transform vowels with decorations in single vowels as follows: á > a, à > a, â > a, and so on; and also remove special characters as dots, hyphens, etc.

Thanks!

EnriMR
  • 3,924
  • 5
  • 39
  • 59
  • 5
    How is `ó` becoming `o` and btw `[a-ZA-Z]` doesn't cover `ó` – anubhava Jun 11 '15 at 11:55
  • 1
    I need to transform vowels with decorations in single vowels as follows: á > a, à > a, â > a, and so on. – EnriMR Jun 11 '15 at 11:57
  • 3
    That requirement must be part of your question not in comments. Also don't forget to show your attempt. – anubhava Jun 11 '15 at 11:58
  • @EnriMR Maybe you can check ASCII to get the values of the specials characters and then make a range. – Francisco Romero Jun 11 '15 at 11:59
  • This seems to be a decent answer for your first need (I like the Guava part) : https://stackoverflow.com/a/4283366/4167384 And this for the special character remplacement : https://stackoverflow.com/a/1453284/4167384 – Akah Jun 11 '15 at 12:05

2 Answers2

15

You can first use a Normalizer and then remove the undesired characters:

String input = "Carmen López-Delina Santos";
String withoutAccent = Normalizer.normalize(input, Normalizer.Form.NFD);
String output = withoutAccent.replaceAll("[^a-zA-Z ]", "");
System.out.println(output); //prints Carmen LopezDelina Santos

Note that this may not work for all and any non-ascii letters in any language - if such a case is encountered the letter would be deleted. One such example is the Turkish i.

The alternative in that situation is probably to list all the possible letters and their replacement...

assylias
  • 321,522
  • 82
  • 660
  • 783
  • 1
    This is exactly what I need because the system is waiting for my XML file doesn't allow any other character for name field – EnriMR Jun 11 '15 at 13:16
1

You can use this removeAccents method with a later replaceAll with [^A-Za-z ]:

public static String removeAccents(String text) {
  return text == null ? null :
    Normalizer.normalize(text, Form.NFD)
        .replaceAll("\\p{InCombiningDiacriticalMarks}+", "");
}

The Normalizer decomposes the original characters into a combination of a base character and a diacritic sign (this could be multiple signs in different languages). á, é and í have the same sign: 0301 for marking the ' accent.

The \p{InCombiningDiacriticalMarks}+ regular expression will match all such diacritic codes and we will replace them with an empty string.

And in the caller:

String original = "Carmen López-Delina Santos";
String res = removeAccents(original).replaceAll("[^A-Za-z ]", "");
System.out.println(res);

See IDEONE demo

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563