12

I want to detect and remove high-ASCII characters like ®, ©, ™ from a String in Java. Is there any open-source library that can do this?

Michael Petrotta
  • 59,888
  • 27
  • 145
  • 179
RandomQuestion
  • 6,778
  • 17
  • 61
  • 97
  • 1
    possible duplicate of [replace special characters in string in java](http://stackoverflow.com/questions/2608205/replace-special-characters-in-string-in-java) – Woot4Moo Feb 15 '11 at 19:23

4 Answers4

35

If you need to remove all non-US-ASCII (i.e. outside 0x0-0x7F) characters, you can do something like this:

s = s.replaceAll("[^\\x00-\\x7f]", "");

If you need to filter many strings, it would be better to use a precompiled pattern:

private static final Pattern nonASCII = Pattern.compile("[^\\x00-\\x7f]");
...
s = nonASCII.matcher(s).replaceAll();

And if it's really performance-critical, perhaps Alex Nikolaenkov's suggestion would be better.

axtavt
  • 239,438
  • 41
  • 511
  • 482
  • Are Type 1 High ASCII characters same as High ASCII characters. Would above regex also remove Symbols like $ and Pound sign? – RandomQuestion Feb 15 '11 at 19:20
  • Be careful if you want to filter a lot of strings with this pattern. It will compile the pattern each time and create new `String` object behind the scenes. – Alex Nikolaenkov Feb 15 '11 at 19:21
  • @Jitendra: It removes all characters that are not in [ASCII table](http://en.wikipedia.org/wiki/ASCII). – axtavt Feb 15 '11 at 19:27
  • @axtavt Is it possible to modify above regex so to allow retaining of certain characters. For e.x. I want to retain £ sign from string. – RandomQuestion Feb 15 '11 at 19:36
  • I am really new at regex's. I found It after little experiment. `code` s.replaceAll("[^\\x00-\\x7f£]", ""); `code` should work. Thanks all !! – RandomQuestion Feb 15 '11 at 19:48
  • If you use Guava like I suggest, CharMatcher.ASCII.and(CharMatcher.of('£')).retainAll(string); – sjr Feb 15 '11 at 20:35
  • You can even replace "®" with "(R)", "©" with "(C)" and "™" with "TM" with replaceAll if you wish. "£" with "#" or "(pound)" – Peter Lawrey Feb 15 '11 at 20:39
  • @axtavt shouldn't this be: `replaceAll("[^\\x20-\\x7f]", "")` – rk2010 Apr 17 '12 at 18:11
  • @axtavt can we replace symbol " also with this ? – Akash Sep 28 '16 at 06:54
16

I think that you can easily filter your string by hand and check code of the particular character. If it fits your requirements then add it to a StringBuilder and do toString() to it in the end.

public static String filter(String str) {
    StringBuilder filtered = new StringBuilder(str.length());
    for (int i = 0; i < str.length(); i++) {
        char current = str.charAt(i);
        if (current >= 0x20 && current <= 0x7e) {
            filtered.append(current);
        }
    }

    return filtered.toString();
}
Alex Nikolaenkov
  • 2,505
  • 20
  • 27
  • Could you please explain in little detail, what do you mean by filter string by hand and check code of particular character. Did you mean above way of filtering. – RandomQuestion Feb 15 '11 at 19:33
  • THis seems to work great, except that it removes newlines for me, and netier of these work if (current >= 0x00 && current <= 0x7e) or if (current == '\n' || (...) ) which is super weird! – user1443778 Jul 25 '12 at 08:19
5

I understand that you need to delete: ç,ã,Ã , but for everybody that need to convert ç,ã,Ã ---> c,a,A please have a look at this piece of code:

Example Code:

final String input = "Tĥïŝ ĩš â fůňķŷ Šťŕĭńġ";
System.out.println(
    Normalizer
        .normalize(input, Normalizer.Form.NFD)
        .replaceAll("[^\\p{ASCII}]", "")
);

Output:

This is a funky String

molu2008
  • 1,237
  • 2
  • 15
  • 20
5

A nice way to do this is to use Google Guava CharMatcher:

String newString = CharMatcher.ASCII.retainFrom(string);

newString will contain only the ASCII characters (code point < 128) from the original string.

This reads more naturally than a regular expression. Regular expressions can take more effort to understand for subsequent readers of your code.

sjr
  • 9,769
  • 1
  • 25
  • 36
  • They can but the above answer axtavt is simple and can be made readable with a simple comment explaining whats happening. The regex code isn't hard at all to decode in his answer. Your answer contains libraries that need to be downloaded and setup as dependencies, much more work than axtavt's answer. – jluzwick Feb 15 '11 at 19:27
  • 1
    Any Java project should include this library anyway. It will save you a lot of work in the long run. Sometimes you have to do a bit of work up front to save more effort later. :) – sjr Feb 15 '11 at 19:29
  • you may be right about this java library being useful (it looks pretty good), but alas does not answer the question as best as the Pattern answer. – jluzwick Feb 15 '11 at 19:36
  • 3
    That depends on your definition of "best". Anyway, I can't convince you, you should use Google Guava wherever you can and let it convince you. – sjr Feb 15 '11 at 20:25