How to check if a String contains only ASCII?

Question

The call Character.isLetter(c) returns true if the character is a letter. But is there a way to quickly find if a String only contains the base characters of ASCII?

score 137 · Accepted Answer · edited Nov 13 '18 at 18:45

137

From Guava 19.0 onward, you may use:

boolean isAscii = CharMatcher.ascii().matchesAllOf(someString);

This uses the matchesAllOf(someString) method which relies on the factory method ascii() rather than the now deprecated ASCII singleton.

Here ASCII includes all ASCII characters including the non-printable characters lower than 0x20 (space) such as tabs, line-feed / return but also BEL with code 0x07 and DEL with code 0x7F.

This code incorrectly uses characters rather than code points, even if code points are indicated in the comments of earlier versions. Fortunately, the characters required to create code point with a value of U+010000 or over uses two surrogate characters with a value outside of the ASCII range. So the method still succeeds in testing for ASCII, even for strings containing emoji's.

For earlier Guava versions without the ascii() method you may write:

boolean isAscii = CharMatcher.ASCII.matchesAllOf(someString);

edited Nov 13 '18 at 18:45

Maarten Bodewes

90,524
13
150
263

answered Aug 27 '10 at 14:22

ColinD

108,630
30
201
202

34

+1 Although it's good if you don't need another third-party library, Colin's answer is much shorter and much more readable. Suggesting third-party libraries is perfectly OK and should not be punished with a negative vote. – Jesper Aug 27 '10 at 20:46
1

I should also point out that CharMatchers are really incredibly powerful and can do waaaay more than this. In addition there are many more predefined CharMatchers besides ASCII, and great factory methods for creating custom ones. – ColinD Aug 28 '10 at 02:49
7

`CharMatcher.ASCII` is deprecated now and about to be remove in June 2018. – thisarattr Jun 29 '17 at 00:56

score 129 · Answer 2 · edited May 04 '23 at 01:50

129

You can do it with java.nio.charset.Charset.

import java.nio.charset.Charset;

public class StringUtils {
  
  public static boolean isPureAscii(String v) {
    return Charset.forName("US-ASCII").newEncoder().canEncode(v);
    // or "ISO-8859-1" for ISO Latin 1
    // or StandardCharsets.US_ASCII with JDK1.7+
  }

  public static void main (String args[])
    throws Exception {

     String test = "Réal";
     System.out.println(test + " isPureAscii() : " + StringUtils.isPureAscii(test));
     test = "Real";
     System.out.println(test + " isPureAscii() : " + StringUtils.isPureAscii(test));
     
     /*
      * output :
      *   Réal isPureAscii() : false
      *   Real isPureAscii() : true
      */
  }
}

edited May 04 '23 at 01:50

sideshowbarker

81,827
26
193
197

answered Aug 27 '10 at 14:37

RealHowTo

34,977
11
70
85

11

I don't think it's a good idea to make the CharsetEncoder static since according to docs "Instances of this class are not safe for use by multiple concurrent threads." – pm_labs Mar 27 '12 at 04:46
@paul_sns, you are right CharsetEncoder is not thread-safe (but Charset is) so it's not a good idea to make it static. – RealHowTo Mar 27 '12 at 11:25
17

With Java 1.7 or greater one can use `StandardCharsets.US_ASCII` instead of `Charset.forName("US-ASCII")`. – Julian Lettner Sep 01 '14 at 08:42
@RealHowTo Correct solutions should not have to rely on comments, care to fix this issue and maybe use a oneliner method based on `StandardCharsets`? I could post another answer but I'd rather fix this highly appreciated answer. – Maarten Bodewes Nov 13 '18 at 18:39

score 84 · Answer 3 · edited Jan 24 '22 at 15:36

84

Here is another way not depending on a library but using a regex.

You can use this single line:

text.matches("\\A\\p{ASCII}*\\z")

Whole example program:

public class Main {
    public static void main(String[] args) {
        char nonAscii = 0x00FF;
        String asciiText = "Hello";
        String nonAsciiText = "Buy: " + nonAscii;
        System.out.println(asciiText.matches("\\A\\p{ASCII}*\\z"));
        System.out.println(nonAsciiText.matches("\\A\\p{ASCII}*\\z"));
    }
}

Understanding the regex :

li \\A : Beginning of input
\\p{ASCII} : Any ASCII character
* : all repetitions
\\z : End of input

edited Jan 24 '22 at 15:36

sud007

5,824
4
56
63

answered Aug 27 '10 at 14:40

Arne Deutsch

14,629
5
53
72

17

\\A - Beginning of input ... \\p{ASCII}* - Any ASCII character any times ...\\z - End of input – Arne Deutsch Jan 28 '15 at 12:26
@ArneDeutsch Do you mind if I improve the answer and include references to `\P{Print}` and `\P{Graph}` + a description? Why do you need `\A` and `\z`? – Maarten Bodewes Nov 13 '18 at 17:41
What is that regex? I know that $ is end of string, ^ is start, never heard of either of \\A \\p \\z, could you please attach the reference to javadoc? – deathangel908 Feb 01 '19 at 09:41
@deathangel908 \A is start of input. \z is end of input. ^ and $ behave differently in MULTILINE mode, and DOTALL changes behavior of \A and \z. See https://stackoverflow.com/a/3652402/1003157 – Raymond Naseef Feb 22 '20 at 19:53

score 63 · Answer 4 · answered Aug 27 '10 at 15:37

63

Iterate through the string and make sure all the characters have a value less than 128.

Java Strings are conceptually encoded as UTF-16. In UTF-16, the ASCII character set is encoded as the values 0 - 127 and the encoding for any non ASCII character (which may consist of more than one Java char) is guaranteed not to include the numbers 0 - 127

answered Aug 27 '10 at 15:37

JeremyP

84,577
15
123
161

32

With Java 1.8 you can do: `str.chars().allMatch(c -> c < 128)` – Julian Lettner Sep 01 '14 at 08:58
9

If you want printable characters you may want to test for `c >= 0x20 && c < 0x7F` as the first 32 values of the 7 bit encoding are control characters and the final value (0x7F) is `DEL`. – Maarten Bodewes Apr 07 '15 at 19:18

score 17 · Answer 5 · answered Dec 28 '12 at 08:14

17

Or you copy the code from the IDN class.

// to check if a string only contains US-ASCII code point
//
private static boolean isAllASCII(String input) {
    boolean isASCII = true;
    for (int i = 0; i < input.length(); i++) {
        int c = input.charAt(i);
        if (c > 0x7F) {
            isASCII = false;
            break;
        }
    }
    return isASCII;
}

answered Dec 28 '12 at 08:14

Zarathustra

2,853
4
33
62

1

This even works with 2-char-unicode because the 1st-char is >= U+D800 – k3b Apr 11 '17 at 04:42
But note that it includes non-printable characters in ASCII (which is correct, but it may not be expected). It is of course possible to directly use `return false` instead of using `isASCII = false` and `break`. – Maarten Bodewes Nov 13 '18 at 18:42
1

This is code from Oracle JDK. Copying might cause legal issues. – Arne Deutsch Nov 20 '18 at 08:24

score 11 · Answer 6 · edited Jul 13 '15 at 21:06

11

commons-lang3 from Apache contains valuable utility/convenience methods for all kinds of 'problems', including this one.

System.out.println(StringUtils.isAsciiPrintable("!@£$%^&!@£$%^"));

edited Jul 13 '15 at 21:06

Kukeltje

12,223
4
24
47

answered Jul 13 '15 at 18:44

fjkjava

1,414
1
19
24

1

Be aware that isAsciiPrintable returns false if the string contains tab or line feed characters (\t \r \n). – TampaHaze Apr 26 '18 at 16:57
@TampaHaze thats because internally, its checking for every character value to be between 32 to 127. I think thats wrong. We should check from 0 to 127 – therealprashant Jul 17 '19 at 07:15
2

@therealprashant if the method name was isAscii I would agree with you. But the method being named isAsciiPrintable implies that they may have purposely excluded characters 0 to 31. – TampaHaze Aug 01 '19 at 13:26

score 4 · Answer 7 · answered Oct 28 '13 at 22:02

4

try this:

for (char c: string.toCharArray()){
  if (((int)c)>127){
    return false;
  } 
}
return true;

answered Oct 28 '13 at 22:02

pforyogurt

101
1
8

1

"Try this" always gets a downvote. What does this *do*? What is included and what isn't? Would get a downvote because you double the size in memory too, by the way. – Maarten Bodewes Nov 13 '18 at 17:33

score 2 · Answer 8 · answered Sep 26 '16 at 14:13

2

private static boolean isASCII(String s) 
{
    for (int i = 0; i < s.length(); i++) 
        if (s.charAt(i) > 127) 
            return false;
    return true;
}

answered Sep 26 '16 at 14:13

Phil

708
1
11
22

1

Code only answer, please indicate what this does, i.e. that it includes non-printable characters and a undefined character (0x7F) if you perform this check. – Maarten Bodewes Nov 13 '18 at 17:31
This one may have bit me after my long-running program failed to find any characters of interest. `charAt` returns a `char`. Can you directly test if a type `char` is greater than an int without converting to an int, first, or does your test automatically do the coversion? Maybe you can and maybe it does? I went ahead and converted this to an int like so: `if ((int)s.charAt(i) > 127)`. Not sure if my results are any different but I feel better about letting it run. We'll see :-\ – harperville Feb 19 '20 at 19:34
This seems to work and was the fasted way for me in a quick series of rather unscientific local micro-benchmarks. The similar approach with "toCharArray" allocates and array and thus is performing worse than this one. One further smalll optimization seems to be to extract the lenght() into a local variable. – centic Mar 29 '23 at 09:17

score 2 · Answer 9 · answered Jun 12 '19 at 23:36

2

This will return true if String only contains ASCII characters and false when it does not

Charset.forName("US-ASCII").newEncoder().canEncode(str)

If You want to remove non ASCII , here is the snippet:

if(!Charset.forName("US-ASCII").newEncoder().canEncode(str)) {
                        str = str.replaceAll("[^\\p{ASCII}]", "");
                    }

answered Jun 12 '19 at 23:36

mike oganyan

137
5

Vanilla Java, simple to read, what's not to like with this answer? Although, to avoid typos in "US-ASCII": `StandardCharsets.US_ASCII.newEncoder().canEncode(str)` – user2077221 Aug 26 '21 at 23:50
Instead of `[^\\p{ASCII}]`, you can simplify it: `\\P{ASCII}`. Capital \P is complement of lowercase \p. – Ahmet Sep 29 '22 at 14:01

score 2 · Answer 10 · answered May 11 '21 at 20:34

2

In Java 8 and above, one can use String#codePoints in conjunction with IntStream#allMatch.

boolean allASCII = str.codePoints().allMatch(c -> c < 128);

answered May 11 '21 at 20:34

Unmitigated

76,500
11
62
80

score 2 · Answer 11 · answered Oct 22 '21 at 21:21

2

In Kotlin:

fun String.isAsciiString() : Boolean =
    this.toCharArray().none { it < ' ' || it > '~' }

answered Oct 22 '21 at 21:21

steven smith

1,519
15
31

score 1 · Answer 12 · answered Aug 27 '10 at 14:21

1

Iterate through the string, and use charAt() to get the char. Then treat it as an int, and see if it has a unicode value (a superset of ASCII) which you like.

Break at the first you don't like.

answered Aug 27 '10 at 14:21

Thorbjørn Ravn Andersen

73,784
33
194
347

score 0 · Answer 13 · edited Feb 14 '15 at 23:47

It was possible. Pretty problem.

import java.io.UnsupportedEncodingException;
import java.nio.charset.Charset;
import java.nio.charset.CharsetEncoder;

public class EncodingTest {

    static CharsetEncoder asciiEncoder = Charset.forName("US-ASCII")
            .newEncoder();

    public static void main(String[] args) {

        String testStr = "¤EÀsÆW°ê»Ú®i¶T¤¤¤ß3¼Ó®i¶TÆU2~~KITEC 3/F Rotunda 2";
        String[] strArr = testStr.split("~~", 2);
        int count = 0;
        boolean encodeFlag = false;

        do {
            encodeFlag = asciiEncoderTest(strArr[count]);
            System.out.println(encodeFlag);
            count++;
        } while (count < strArr.length);
    }

    public static boolean asciiEncoderTest(String test) {
        boolean encodeFlag = false;
        try {
            encodeFlag = asciiEncoder.canEncode(new String(test
                    .getBytes("ISO8859_1"), "BIG5"));
        } catch (UnsupportedEncodingException e) {
            e.printStackTrace();
        }
        return encodeFlag;
    }
}

score -2 · Answer 14 · answered Feb 14 '15 at 23:13

-2

//return is uppercase or lowercase
public boolean isASCIILetter(char c) {
  return (c > 64 && c < 91) || (c > 96 && c < 123);
}

answered Feb 14 '15 at 23:13

Lukas Greblikas

649
6
14

A code only answer with 4 magics, and no explanation what it *does*. Please adjust. – Maarten Bodewes Nov 13 '18 at 17:32

How to check if a String contains only ASCII?

14 Answers14

Linked

Related