144

The call Character.isLetter(c) returns true if the character is a letter. But is there a way to quickly find if a String only contains the base characters of ASCII?

isapir
  • 21,295
  • 13
  • 115
  • 116
TambourineMan
  • 1,443
  • 2
  • 9
  • 5

14 Answers14

137

From Guava 19.0 onward, you may use:

boolean isAscii = CharMatcher.ascii().matchesAllOf(someString);

This uses the matchesAllOf(someString) method which relies on the factory method ascii() rather than the now deprecated ASCII singleton.

Here ASCII includes all ASCII characters including the non-printable characters lower than 0x20 (space) such as tabs, line-feed / return but also BEL with code 0x07 and DEL with code 0x7F.

This code incorrectly uses characters rather than code points, even if code points are indicated in the comments of earlier versions. Fortunately, the characters required to create code point with a value of U+010000 or over uses two surrogate characters with a value outside of the ASCII range. So the method still succeeds in testing for ASCII, even for strings containing emoji's.

For earlier Guava versions without the ascii() method you may write:

boolean isAscii = CharMatcher.ASCII.matchesAllOf(someString);
Maarten Bodewes
  • 90,524
  • 13
  • 150
  • 263
ColinD
  • 108,630
  • 30
  • 201
  • 202
  • 34
    +1 Although it's good if you don't need another third-party library, Colin's answer is much shorter and much more readable. Suggesting third-party libraries is perfectly OK and should not be punished with a negative vote. – Jesper Aug 27 '10 at 20:46
  • 1
    I should also point out that CharMatchers are really incredibly powerful and can do waaaay more than this. In addition there are many more predefined CharMatchers besides ASCII, and great factory methods for creating custom ones. – ColinD Aug 28 '10 at 02:49
  • 7
    `CharMatcher.ASCII` is deprecated now and about to be remove in June 2018. – thisarattr Jun 29 '17 at 00:56
129

You can do it with java.nio.charset.Charset.

import java.nio.charset.Charset;

public class StringUtils {
  
  public static boolean isPureAscii(String v) {
    return Charset.forName("US-ASCII").newEncoder().canEncode(v);
    // or "ISO-8859-1" for ISO Latin 1
    // or StandardCharsets.US_ASCII with JDK1.7+
  }

  public static void main (String args[])
    throws Exception {

     String test = "Réal";
     System.out.println(test + " isPureAscii() : " + StringUtils.isPureAscii(test));
     test = "Real";
     System.out.println(test + " isPureAscii() : " + StringUtils.isPureAscii(test));
     
     /*
      * output :
      *   Réal isPureAscii() : false
      *   Real isPureAscii() : true
      */
  }
}
sideshowbarker
  • 81,827
  • 26
  • 193
  • 197
RealHowTo
  • 34,977
  • 11
  • 70
  • 85
  • 11
    I don't think it's a good idea to make the CharsetEncoder static since according to docs "Instances of this class are not safe for use by multiple concurrent threads." – pm_labs Mar 27 '12 at 04:46
  • @paul_sns, you are right CharsetEncoder is not thread-safe (but Charset is) so it's not a good idea to make it static. – RealHowTo Mar 27 '12 at 11:25
  • 17
    With Java 1.7 or greater one can use `StandardCharsets.US_ASCII` instead of `Charset.forName("US-ASCII")`. – Julian Lettner Sep 01 '14 at 08:42
  • @RealHowTo Correct solutions should not have to rely on comments, care to fix this issue and maybe use a oneliner method based on `StandardCharsets`? I could post another answer but I'd rather fix this highly appreciated answer. – Maarten Bodewes Nov 13 '18 at 18:39
84

Here is another way not depending on a library but using a regex.

You can use this single line:

text.matches("\\A\\p{ASCII}*\\z")

Whole example program:

public class Main {
    public static void main(String[] args) {
        char nonAscii = 0x00FF;
        String asciiText = "Hello";
        String nonAsciiText = "Buy: " + nonAscii;
        System.out.println(asciiText.matches("\\A\\p{ASCII}*\\z"));
        System.out.println(nonAsciiText.matches("\\A\\p{ASCII}*\\z"));
    }
}

Understanding the regex :

  • li \\A : Beginning of input
  • \\p{ASCII} : Any ASCII character
  • * : all repetitions
  • \\z : End of input
sud007
  • 5,824
  • 4
  • 56
  • 63
Arne Deutsch
  • 14,629
  • 5
  • 53
  • 72
  • 17
    \\A - Beginning of input ... \\p{ASCII}* - Any ASCII character any times ...\\z - End of input – Arne Deutsch Jan 28 '15 at 12:26
  • @ArneDeutsch Do you mind if I improve the answer and include references to `\P{Print}` and `\P{Graph}` + a description? Why do you need `\A` and `\z`? – Maarten Bodewes Nov 13 '18 at 17:41
  • What is that regex? I know that $ is end of string, ^ is start, never heard of either of \\A \\p \\z, could you please attach the reference to javadoc? – deathangel908 Feb 01 '19 at 09:41
  • @deathangel908 \A is start of input. \z is end of input. ^ and $ behave differently in MULTILINE mode, and DOTALL changes behavior of \A and \z. See https://stackoverflow.com/a/3652402/1003157 – Raymond Naseef Feb 22 '20 at 19:53
63

Iterate through the string and make sure all the characters have a value less than 128.

Java Strings are conceptually encoded as UTF-16. In UTF-16, the ASCII character set is encoded as the values 0 - 127 and the encoding for any non ASCII character (which may consist of more than one Java char) is guaranteed not to include the numbers 0 - 127

JeremyP
  • 84,577
  • 15
  • 123
  • 161
  • 32
    With Java 1.8 you can do: `str.chars().allMatch(c -> c < 128)` – Julian Lettner Sep 01 '14 at 08:58
  • 9
    If you want printable characters you may want to test for `c >= 0x20 && c < 0x7F` as the first 32 values of the 7 bit encoding are control characters and the final value (0x7F) is `DEL`. – Maarten Bodewes Apr 07 '15 at 19:18
17

Or you copy the code from the IDN class.

// to check if a string only contains US-ASCII code point
//
private static boolean isAllASCII(String input) {
    boolean isASCII = true;
    for (int i = 0; i < input.length(); i++) {
        int c = input.charAt(i);
        if (c > 0x7F) {
            isASCII = false;
            break;
        }
    }
    return isASCII;
}
Zarathustra
  • 2,853
  • 4
  • 33
  • 62
  • 1
    This even works with 2-char-unicode because the 1st-char is >= U+D800 – k3b Apr 11 '17 at 04:42
  • But note that it includes non-printable characters in ASCII (which is correct, but it may not be expected). It is of course possible to directly use `return false` instead of using `isASCII = false` and `break`. – Maarten Bodewes Nov 13 '18 at 18:42
  • 1
    This is code from Oracle JDK. Copying might cause legal issues. – Arne Deutsch Nov 20 '18 at 08:24
11

commons-lang3 from Apache contains valuable utility/convenience methods for all kinds of 'problems', including this one.

System.out.println(StringUtils.isAsciiPrintable("!@£$%^&!@£$%^"));
Kukeltje
  • 12,223
  • 4
  • 24
  • 47
fjkjava
  • 1,414
  • 1
  • 19
  • 24
  • 1
    Be aware that isAsciiPrintable returns false if the string contains tab or line feed characters (\t \r \n). – TampaHaze Apr 26 '18 at 16:57
  • @TampaHaze thats because internally, its checking for every character value to be between 32 to 127. I think thats wrong. We should check from 0 to 127 – therealprashant Jul 17 '19 at 07:15
  • 2
    @therealprashant if the method name was isAscii I would agree with you. But the method being named isAsciiPrintable implies that they may have purposely excluded characters 0 to 31. – TampaHaze Aug 01 '19 at 13:26
4

try this:

for (char c: string.toCharArray()){
  if (((int)c)>127){
    return false;
  } 
}
return true;
pforyogurt
  • 101
  • 1
  • 8
  • 1
    "Try this" always gets a downvote. What does this *do*? What is included and what isn't? Would get a downvote because you double the size in memory too, by the way. – Maarten Bodewes Nov 13 '18 at 17:33
2
private static boolean isASCII(String s) 
{
    for (int i = 0; i < s.length(); i++) 
        if (s.charAt(i) > 127) 
            return false;
    return true;
}
Phil
  • 708
  • 1
  • 11
  • 22
  • 1
    Code only answer, please indicate what this does, i.e. that it includes non-printable characters and a undefined character (0x7F) if you perform this check. – Maarten Bodewes Nov 13 '18 at 17:31
  • This one may have bit me after my long-running program failed to find any characters of interest. `charAt` returns a `char`. Can you directly test if a type `char` is greater than an int without converting to an int, first, or does your test automatically do the coversion? Maybe you can and maybe it does? I went ahead and converted this to an int like so: `if ((int)s.charAt(i) > 127)`. Not sure if my results are any different but I feel better about letting it run. We'll see :-\ – harperville Feb 19 '20 at 19:34
  • This seems to work and was the fasted way for me in a quick series of rather unscientific local micro-benchmarks. The similar approach with "toCharArray" allocates and array and thus is performing worse than this one. One further smalll optimization seems to be to extract the lenght() into a local variable. – centic Mar 29 '23 at 09:17
2

This will return true if String only contains ASCII characters and false when it does not

Charset.forName("US-ASCII").newEncoder().canEncode(str)

If You want to remove non ASCII , here is the snippet:

if(!Charset.forName("US-ASCII").newEncoder().canEncode(str)) {
                        str = str.replaceAll("[^\\p{ASCII}]", "");
                    }
mike oganyan
  • 137
  • 5
  • Vanilla Java, simple to read, what's not to like with this answer? Although, to avoid typos in "US-ASCII": `StandardCharsets.US_ASCII.newEncoder().canEncode(str)` – user2077221 Aug 26 '21 at 23:50
  • Instead of `[^\\p{ASCII}]`, you can simplify it: `\\P{ASCII}`. Capital \P is complement of lowercase \p. – Ahmet Sep 29 '22 at 14:01
2

In Java 8 and above, one can use String#codePoints in conjunction with IntStream#allMatch.

boolean allASCII = str.codePoints().allMatch(c -> c < 128);
Unmitigated
  • 76,500
  • 11
  • 62
  • 80
2

In Kotlin:

fun String.isAsciiString() : Boolean =
    this.toCharArray().none { it < ' ' || it > '~' }
steven smith
  • 1,519
  • 15
  • 31
1

Iterate through the string, and use charAt() to get the char. Then treat it as an int, and see if it has a unicode value (a superset of ASCII) which you like.

Break at the first you don't like.

Thorbjørn Ravn Andersen
  • 73,784
  • 33
  • 194
  • 347
0

It was possible. Pretty problem.

import java.io.UnsupportedEncodingException;
import java.nio.charset.Charset;
import java.nio.charset.CharsetEncoder;

public class EncodingTest {

    static CharsetEncoder asciiEncoder = Charset.forName("US-ASCII")
            .newEncoder();

    public static void main(String[] args) {

        String testStr = "¤EÀsÆW°ê»Ú®i¶T¤¤¤ß3¼Ó®i¶TÆU2~~KITEC 3/F Rotunda 2";
        String[] strArr = testStr.split("~~", 2);
        int count = 0;
        boolean encodeFlag = false;

        do {
            encodeFlag = asciiEncoderTest(strArr[count]);
            System.out.println(encodeFlag);
            count++;
        } while (count < strArr.length);
    }

    public static boolean asciiEncoderTest(String test) {
        boolean encodeFlag = false;
        try {
            encodeFlag = asciiEncoder.canEncode(new String(test
                    .getBytes("ISO8859_1"), "BIG5"));
        } catch (UnsupportedEncodingException e) {
            e.printStackTrace();
        }
        return encodeFlag;
    }
}
RealHowTo
  • 34,977
  • 11
  • 70
  • 85
-2
//return is uppercase or lowercase
public boolean isASCIILetter(char c) {
  return (c > 64 && c < 91) || (c > 96 && c < 123);
}
Lukas Greblikas
  • 649
  • 6
  • 14