Remove "empty" character from String

Question

I'm using a framwork which returns malformed Strings with "empty" characters from time to time.

"foobar" for example is represented by: [,f,o,o,b,a,r]

The first character is NOT a whitespace (' '), so a System.out.printlin() would return "foobar" and not " foobar". Yet, the length of the String is 7 instead of 6. Obviously this makes most String methods (equals, split, substring,..) useless. Is there a way to remove empty characters from a String?

I tried to build a new String like this:

StringBuilder sb = new StringBuilder();
for (final char character : malformedString.toCharArray()) {
  if (Character.isDefined(character)) {
    sb.append(character);
  }
}
sb.toString();

Unfortunately this doesn't work. Same with the following code:

StringBuilder sb = new StringBuilder();
for (final Character character : malformedString.toCharArray()) {
  if (character != null) {
    sb.append(character);
  }
}
sb.toString();

I also can't check for an empty character like this:

   if (character == ''){
     //
   }

Obviously there is something wrong with the String .. but I can't change the framework I'm using or wait for them to fix it (if it is a bug within their framework). I need to handle this String and sanatize it.

Any ideas?

What is it then? Try writing out the unicode number of each character (just cast the char to an int). — Thorbjørn Ravn Andersen, Aug 03 '10 at 12:44
Is that a U+FEFF character maybe? Then it might be the byte order mark from a file stored as UTF-* — Joey, Aug 03 '10 at 12:45

score 20 · Answer 1 · answered Aug 03 '10 at 13:13

Regex would be an appropriate way to sanitize the string from unwanted Unicode characters in this case.

String sanitized = dirty.replaceAll("[\uFEFF-\uFFFF]", "");

This will replace all char in \uFEFF-\uFFFF range with the empty string.

The [...] construct is called a character class, e.g. [aeiou] matches one of any of the lowercase vowels, [^aeiou] matches anything but.

You can do one of these two approaches:

replaceAll("[_blacklist]", "")
replaceAll("[^_whitelist]", "")

References

regular-expressions.info

score 16 · Accepted Answer · edited May 23 '17 at 12:25

It's probably the NULL character which is represented by \0. You can get rid of it by String#trim().

To nail down the exact codepoint, do so:

for (char c : string.toCharArray()) {
    System.out.printf("U+%04x ", (int) c);
}

Then you can find the exact character here.

Update: as per the update:

Anyone know of a way to just include a range of valid characters instead of excluding 95% of the UTF8 range?

You can do that with help of regex. See the answer of @polygenelubricants here and this answer.

On the other hand, you can also just fix the problem in its root instead of workarounding it. Either update the files to get rid of the BOM mark, it's a legacy way to distinguish UTF-8 files from others which is nowadays worthless, or use a Reader which recognizes and skips the BOM. Also see this question.

score 7 · Answer 3 · answered Sep 24 '13 at 17:24

A very simple way to remove the UTF-8 BOM from a string, using substring as Denis Tulskiy suggested. No looping needed. Just checks the first character for the mark and skips it if needed.

public static String removeUTF8BOM(String s) {
    if (s.startsWith("\uFEFF")) {
        s = s.substring(1);
    }
    return s;
}

I needed to add this to my code when using the Apache HTTPClient EntityUtil to read from a webserver. The webserver was not sending the blank mark but it was getting pulled in while reading the input stream. Original article can be found here.

Thank you for pointing this out, you saved me a lot of time :-) — slodeveloper, Dec 02 '17 at 21:03

score 2 · Answer 4 · answered Aug 03 '10 at 13:09

2

Thank you Johannes Rössel. It actually was '\uFEFF'

The following code works:

 final StringBuilder sb = new StringBuilder();
    for (final char character : body.toCharArray()) {
       if (character != '\uFEFF') {
          sb.append(character);
       }
     }  
 final String sanitzedString = sb.toString();

Anyone know of a way to just include a range of valid characters instead of excluding 95% of the UTF8 range?

answered Aug 03 '10 at 13:09

black666

2,997
7
25
40

1

You should then define "valid characters" more precisely. – BalusC Aug 03 '10 at 13:18
this is inefficient, just check if the first character is FEFF and use substring, `String.trim()` will do the rest. – Denis Tulskiy Aug 03 '10 at 16:39

ESP · Answer 5 · 2010-08-03T12:55:04.207

1

trim left or right removes white spaces. does it has a colon before space?

even more: a=(long) string[0]; will show u the char code, and u can use replace() or substring.

edited Aug 03 '10 at 12:55

answered Aug 03 '10 at 12:44

ESP

13
3

score 0 · Answer 6 · answered Apr 25 '17 at 10:25

This is what worked for me:-

    StringBuilder sb = new StringBuilder();
    for (char character : myString.toCharArray()) {
        int i = (int) character;
        if (i > 0 && i <= 256) {
            sb.append(character);
        }
    }  
    return sb.toString();

The int value of my NULL characters was in the region of 8103 or something.

score 0 · Answer 7 · answered Aug 04 '21 at 13:52

0

You can try replace:

s.replace("\u200B", "")

or

s.replace("\uFEFF", "")

Kotlin:

s.filter { it == '\u200B' }

answered Aug 04 '21 at 13:52

Denis Rybnikov

169
1
4

score -1 · Answer 8 · edited Apr 30 '13 at 20:58

-1

for (int i = 0; i < s.length(); i++)
    if (s.charAt(i) == ' ') {
        your code....
    }

edited Apr 30 '13 at 20:58

acdcjunior

132,397
37
331
304

answered Apr 30 '13 at 20:36

Ilia Altshuler

1

score -1 · Answer 9 · answered Jan 22 '18 at 11:41

-1

Simply malformedString.trim() will solve the issue.

answered Jan 22 '18 at 11:41

Lalji Gajera

471
1
5
10

3

No, it doesn't: `"\uFEFFTYPE".trim().equals("\uFEFFTYPE")` – Kariem Sep 24 '18 at 10:09

score -3 · Answer 10 · answered Aug 03 '10 at 12:45

-3

You could check for the whitespace like this:

if (character.equals(' ')){ // }

answered Aug 03 '10 at 12:45

2

The question already establishes that the character is not a space. – Nick Aug 03 '10 at 12:57
The question does say that it is not whitespace; however, in the three code examples given he is using comparison operators to check for the character, and if I am not mistaken you cannot use comparison operators to check for a certain character because they are checking if you are referencing the same place in memory not the character code. It was just a helpful suggestion / option based on the code provided. – Aug 03 '10 at 13:32
I see where you're coming from - for a Character object, using equals() is the right thing to do. I tend to keep to chars when dealing which characters, and with a char you *can* use == since it's a primitive type. – Nick Aug 04 '10 at 20:36

Remove "empty" character from String

10 Answers10

References

Linked