How can non-ASCII characters be removed from a string?

Question

I have strings "A função", "Ãugent" in which I need to replace characters like ç, ã, and Ã with empty strings.

How can I remove those non-ASCII characters from my string?

I have attempted to implement this using the following function, but it is not working properly. One problem is that the unwanted characters are getting replaced by the space character.

public static String matchAndReplaceNonEnglishChar(String tmpsrcdta) {
    String newsrcdta = null;
    char array[] = Arrays.stringToCharArray(tmpsrcdta);
    if (array == null)
        return newsrcdta;

    for (int i = 0; i < array.length; i++) {
        int nVal = (int) array[i];
        boolean bISO =
                // Is character ISO control
                Character.isISOControl(array[i]);
        boolean bIgnorable =
                // Is Ignorable identifier
                Character.isIdentifierIgnorable(array[i]);
        // Remove tab and other unwanted characters..
        if (nVal == 9 || bISO || bIgnorable)
            array[i] = ' ';
        else if (nVal > 255)
            array[i] = ' ';
    }
    newsrcdta = Arrays.charArrayToString(array);

    return newsrcdta;
}

Possible duplicate of [Fastest way to strip all non-printable characters from a Java String](http://stackoverflow.com/questions/7161534/fastest-way-to-strip-all-non-printable-characters-from-a-java-string) — Stewart, Oct 14 '16 at 17:34

score 180 · Accepted Answer · answered Dec 15 '11 at 12:05

180

This will search and replace all non ASCII letters:

String resultString = subjectString.replaceAll("[^\\x00-\\x7F]", "");

answered Dec 15 '11 at 12:05

FailedDev

26,680
9
53
73

thanks for response.. but this "A" is still not able to replace with empty string. – rahulsri Dec 15 '11 at 12:31
3

@rahulsri A is a perfectly valid ASCII character. Why should it be replaced? – FailedDev Dec 15 '11 at 12:39
@Dev i think it is not visible but this is a Latin character whose Unicode value is "\u00c3". – rahulsri Dec 15 '11 at 12:51
@rahulsri Can you post this, which cannot be replaced by editing your question please? – FailedDev Dec 15 '11 at 12:59
@rahulsri **\u00c3** == `Ã` and yes, it is replaced. You have something wrong elsewhere. – FailedDev Dec 15 '11 at 13:08
52

Most likely you want to strip non-printable and control characters, too. In that case you would use the following regexp: `"[^\\x20-\\x7E]"` Or simply: `"[^ -~]"` – Zouppen Dec 19 '12 at 11:43
3

`"[^\\p{ASCII}]"` is an equivalent alternative to `"[^\\x00-\\x7F]"`. – M. Justin Dec 06 '20 at 08:02

Michael Böckling · Answer 2 · 2014-04-23T14:18:59.487

97

FailedDev's answer is good, but can be improved. If you want to preserve the ascii equivalents, you need to normalize first:

String subjectString = "öäü";
subjectString = Normalizer.normalize(subjectString, Normalizer.Form.NFD);
String resultString = subjectString.replaceAll("[^\\x00-\\x7F]", "");

=> will produce "oau"

That way, characters like "öäü" will be mapped to "oau", which at least preserves some information. Without normalization, the resulting String will be blank.

edited Apr 23 '14 at 14:18

answered Jul 22 '13 at 11:07

Michael Böckling

7,341
6
55
76

6

Your answer is good, but can be improved. Removing the usage of Regex in your code and replacing it with a for loop is incredibly faster (20-40x). More here: http://stackoverflow.com/a/15191508/2511884 – Saket Dec 28 '14 at 12:32
Thanks for the hint. The extent of the difference in performance was unexpected. – Michael Böckling Dec 28 '14 at 16:49
4

You probably want to use Normalizer.Form.NFKD rather than NFD - NFKD will convert things like ligatures into ascii characters (eg ﬁ to fi), NFD will not do this. – chesterm8 Feb 15 '17 at 00:56
`Normalizer.normalize("ãéío – o áá", Normalizer.Form.NFD).replaceAll("[^\\x00-\\x7F]", "");` yields "aeio o aa" but `echo "ãéío – o áá" | iconv -f utf8 -t ascii//TRANSLIT` yields "aeio - o aa". Is there a way to make java replace "–" with "-" like with iconv? – dvlcube Sep 20 '17 at 20:36

score 27 · Answer 3 · answered Dec 15 '11 at 12:09

27

This would be the Unicode solution

String s = "A função, Ãugent";
String r = s.replaceAll("\\P{InBasic_Latin}", "");

\p{InBasic_Latin} is the Unicode block that contains all letters in the Unicode range U+0000..U+007F (see regular-expression.info)

\P{InBasic_Latin} is the negated \p{InBasic_Latin}

answered Dec 15 '11 at 12:09

stema

90,351
20
107
135

7

(Note to anyone confused like me: the uppercase \P is negation.) – ShreevatsaR Dec 31 '13 at 04:31
2

@user1187719, you could be more precise, than "This does not work". This answer already received some upvotes, so it can not be completely useless. Of course, if you have a Java version before [Java 7](http://docs.oracle.com/javase/tutorial/essential/regex/unicode.html), than I agree. Unicode in regex is not working there. – stema Dec 24 '14 at 12:43
@stema - I ran it in Java 6, so your Java 7 theory holds water. – Entropy Jan 05 '15 at 12:41
it removes the special characters and "not" replace them with ASCII equivalent – AL̲̳I Aug 05 '16 at 11:41
@Ali, yes you exactly understood my answer. This is what has been asked for 5 years ago. If it is not what you need, go with Michael Böcklings answer. – stema Aug 05 '16 at 11:50

score 3 · Answer 4 · answered Dec 15 '11 at 12:19

3

You can try something like this. Special Characters range for alphabets starts from 192, so you can avoid such characters in the result.

String name = "A função";

StringBuilder result = new StringBuilder();
for(char val : name.toCharArray()) {
    if(val < 192) result.append(val);
}
System.out.println("Result "+result.toString());

answered Dec 15 '11 at 12:19

mmodi

61
1
7

Why do you check against 192 and not 128 (what would be the ASCII table)? You are assuming a certain encoding (I think ISO-8859-1), but what if the encoding is ISO-8859-2/3/4/5/7... ? There are letters in those area of the table. – stema Dec 15 '11 at 12:43
Yes, It depends upon the number of characters we want to allow as well as the encoding. This is just the example. We can add condition based on required characters and encoding. – mmodi Dec 16 '11 at 12:03

score 2 · Answer 5 · edited Dec 20 '20 at 04:29

Or you can use the function below for removing non-ascii character from the string. You will get know internal working.

private static String removeNonASCIIChar(String str) {
    StringBuffer buff = new StringBuffer();
    char chars[] = str.toCharArray();

    for (int i = 0; i < chars.length; i++) {
        if (0 < chars[i] && chars[i] < 127) {
            buff.append(chars[i]);
        }
    }
    return buff.toString();
}

score 2 · Answer 6 · answered Jul 25 '20 at 22:17

[Updated solution]

can be used with "Normalize" (Canonical decomposition) and "replaceAll", to replace it with the appropriate characters.

import java.text.Normalizer;
import java.text.Normalizer.Form;
import java.util.regex.Pattern;

public final class NormalizeUtils {

    public static String normalizeASCII(final String string) {
        final String normalize = Normalizer.normalize(string, Form.NFD);

        return Pattern.compile("\\p{InCombiningDiacriticalMarks}+")
                      .matcher(normalize)
                      .replaceAll("");
    } ...

M. Justin · Answer 7 · 2021-11-28T05:38:41.270

String s = "A função";
String stripped = s.replaceAll("\\P{ASCII}", "");
System.out.println(stripped); // Prints "A funo"

or

private static final Pattern NON_ASCII_PATTERN = Pattern.compile("\\P{ASCII}");

public static String matchAndReplaceNonEnglishChar(String tmpsrcdta) {
    return NON_ASCII_PATTERN.matcher(s).replaceAll("");
}

public static void main(String[] args) {
    matchAndReplaceNonEnglishChar("A função"); // Prints "A funo"
}

Explanation

The method String.replaceAll(String regex, String replacement) replaces all instances of a given regular expression (regex) with a given replacement string.

Replaces each substring of this string that matches the given regular expression with the given replacement.

Java has the "\p{ASCII}" regular expression construct which matches any ASCII character, and its inverse, "\P{ASCII}", which matches any non-ASCII character. The matched characters can then be replaced with the empty string, effectively removing them from the resulting string.

String s = "A função";
String stripped = s.replaceAll("\\P{ASCII}", "");
System.out.println(stripped); // Prints "A funo"

The full list of valid regex constructs is documented in the Pattern class.

Note: If you are going to be calling this pattern multiple times within a run, it will be more efficient to use a compiled Pattern directly, rather than String.replaceAll. This way the pattern is compiled only once and reused, rather than each time replaceAll is called:

public class AsciiStripper {
    private static final Pattern NON_ASCII_PATTERN = Pattern.compile("\\P{ASCII}");
    
    public static String stripNonAscii(String s) {
        return NON_ASCII_PATTERN.matcher(s).replaceAll("");
    }
}

score 1 · Answer 8 · 2020-12-22T05:35:52.113

The ASCII table contains 128 codes, with a total of 95 printable characters, of which only 52 characters are letters:

[0-127] ASCII codes
- [32-126] printable characters
  - [48-57] digits [0-9]
  - [65-90] uppercase letters [A-Z]
  - [97-122] lowercase letters [a-z]

You can use String.codePoints method to get a stream over int values of characters of this string and filter out non-ASCII characters:

String str1 = "A função, Ãugent";

String str2 = str1.codePoints()
        .filter(ch -> ch < 128)
        .mapToObj(Character::toString)
        .collect(Collectors.joining());

System.out.println(str2); // A funo, ugent

Or you can explicitly specify character ranges. For example filter out everything except letters:

String str3 = str1.codePoints()
        .filter(ch -> ch >= 'A' && ch <= 'Z'
                || ch >= 'a' && ch <= 'z')
        .mapToObj(Character::toString)
        .collect(Collectors.joining());

System.out.println(str3); // Afunougent

^{See also: How do I not take Special Characters in my Password Validation (without Regex)?}

score 1 · Answer 9 · answered Mar 11 '22 at 23:37

An easily-readable, ascii-printable, streams solution:

String result = str.chars()
    .filter(c -> isAsciiPrintable((char) c))
    .mapToObj(c -> String.valueOf((char) c))
    .collect(Collectors.joining());

private static boolean isAsciiPrintable(char ch) {
    return ch >= 32 && ch < 127;
}

To convert to "_": .map(c -> isAsciiPrintable((char) c) ? c : '_')

32 to 127 is equivalent to the regex [^\\x20-\\x7E] (from comment on the regex solution)

Source for isAsciiPrintable: http://www.java2s.com/Code/Java/Data-Type/ChecksifthestringcontainsonlyASCIIprintablecharacters.htm

score 0 · Answer 10 · answered Dec 07 '20 at 15:38

0

CharMatcher.retainFrom can be used, if you're using the Google Guava library:

String s = "A função";
String stripped = CharMatcher.ascii().retainFrom(s);
System.out.println(stripped); // Prints "A funo"

answered Dec 07 '20 at 15:38

M. Justin

14,487
7
91
130

How can non-ASCII characters be removed from a string?

10 Answers10

Explanation

Linked

Related