36

I have a string that contains a character � I haven't been able to replace it correctly.

String.replace("�", "");

doesn't work, does anyone know how to remove/replace the � in the string?

smci
  • 32,567
  • 20
  • 113
  • 146
Thizzer
  • 16,153
  • 28
  • 98
  • 139
  • 4
    What is the Unicode code point(s) for what you want to replace? – Kathy Van Stone Sep 28 '09 at 19:30
  • 4
    As per the answer from Gunslinger47, the character that MrThys wants to replace is almost certainly "�", as this has the UTF-8 sequence of 0xEF 0xBF 0xBD, which is the sequence given to us by McDowell – Paul Wagland Sep 28 '09 at 22:52
  • 8
    For anyone who has tripped on this, and does not understand why the characters `"�` are produced during processing, there is a write-up ( **disclaimer:** I wrote it) that explains why it happens, at [this StackOverflow question](http://stackoverflow.com/questions/6366912/reading-file-from-windows-and-linux-yields-different-results-character-encoding/6367675#6367675). – Vineet Reynolds Jun 16 '11 at 06:14

10 Answers10

42

That's the Unicode Replacement Character, \uFFFD. (info)

Something like this should work:

String strImport = "For some reason my �double quotes� were lost.";
strImport = strImport.replaceAll("\uFFFD", "\"");
Gunslinger47
  • 7,001
  • 2
  • 21
  • 29
  • 15
    in this case you know it was the double quotes but technically those missing chars can be some other chars, correct? – Elzo Valugi Jan 04 '11 at 15:44
  • 3
    @Elzo: Yes. Looking at my string, the two characters were likely “ and ” to begin with, but they could have been any number of other things. – Gunslinger47 Jan 04 '11 at 18:59
  • i also fixed this by opening the db in textwrangler and doing a find and replace – owen gerig May 14 '12 at 20:01
17

Character issues like this are difficult to diagnose because information is easily lost through misinterpretation of characters via application bugs, misconfiguration, cut'n'paste, etc.

As I (and apparently others) see it, you've pasted three characters:

codepoint   glyph   escaped    windows-1252    info
=======================================================================
U+00ef      ï       \u00ef     ef,             LATIN_1_SUPPLEMENT, LOWERCASE_LETTER
U+00bf      ¿       \u00bf     bf,             LATIN_1_SUPPLEMENT, OTHER_PUNCTUATION
U+00bd      ½       \u00bd     bd,             LATIN_1_SUPPLEMENT, OTHER_NUMBER

To identify the character, download and run the program from this page. Paste your character into the text field and select the glyph mode; paste the report into your question. It'll help people identify the problematic character.

McDowell
  • 107,573
  • 31
  • 204
  • 267
12

You are asking to replace the character "�" but for me that is coming through as three characters 'ï', '¿' and '½'. This might be your problem... If you are using Java prior to Java 1.5 then you only get the UCS-2 characters, that is only the first 65K UTF-8 characters. Based on other comments, it is most likely that the character that you are looking for is '�', that is the Unicode replacement character. This is the character that is "used to replace an incoming character whose value is unknown or unrepresentable in Unicode".

Actually, looking at the comment from Kathy, the other issue that you might be having is that javac is not interpreting your .java file as UTF-8, assuming that you are writing it in UTF-8. Try using:

javac -encoding UTF-8 xx.java

Or, modify your source code to do:

String.replaceAll("\uFFFD", "");
Paul Wagland
  • 27,756
  • 10
  • 52
  • 74
  • For you it might be seen as one character, the rest of us are not so lucky ;-) Please tell us the code point of the character that you are trying to replace. – Paul Wagland Sep 28 '09 at 19:53
6

As others have said, you posted 3 characters instead of one. I suggest you run this little snippet of code to see what's actually in your string:

public static void dumpString(String text)
{
    for (int i=0; i < text.length(); i++)
    {
        System.out.println("U+" + Integer.toString(text.charAt(i), 16) 
                           + " " + text.charAt(i));
    }
}

If you post the results of that, it'll be easier to work out what's going on. (I haven't bothered padding the string - we can do that by inspection...)

Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
2

Change the Encoding to UTF-8 while parsing .This will remove the special characters

Arjun
  • 59
  • 1
  • 8
0

Use the unicode escape sequence. First you'll have to find the codepoint for the character you seek to replace (let's just say it is ABCD in hex):

str = str.replaceAll("\uABCD", "");
matt b
  • 138,234
  • 66
  • 282
  • 345
0

for detail

import java.io.UnsupportedEncodingException;

/**
 * File: BOM.java
 * 
 * check if the bom character is present in the given string print the string
 * after skipping the utf-8 bom characters print the string as utf-8 string on a
 * utf-8 console
 */

public class BOM
{
    private final static String BOM_STRING = "Hello World";
    private final static String ISO_ENCODING = "ISO-8859-1";
    private final static String UTF8_ENCODING = "UTF-8";
    private final static int UTF8_BOM_LENGTH = 3;

    public static void main(String[] args) throws UnsupportedEncodingException {
        final byte[] bytes = BOM_STRING.getBytes(ISO_ENCODING);
        if (isUTF8(bytes)) {
            printSkippedBomString(bytes);
            printUTF8String(bytes);
        }
    }

    private static void printSkippedBomString(final byte[] bytes) throws UnsupportedEncodingException {
        int length = bytes.length - UTF8_BOM_LENGTH;
        byte[] barray = new byte[length];
        System.arraycopy(bytes, UTF8_BOM_LENGTH, barray, 0, barray.length);
        System.out.println(new String(barray, ISO_ENCODING));
    }

    private static void printUTF8String(final byte[] bytes) throws UnsupportedEncodingException {
        System.out.println(new String(bytes, UTF8_ENCODING));
    }

    private static boolean isUTF8(byte[] bytes) {
        if ((bytes[0] & 0xFF) == 0xEF && 
            (bytes[1] & 0xFF) == 0xBB && 
            (bytes[2] & 0xFF) == 0xBF) {
            return true;
        }
        return false;
    }
}
Zar E Ahmer
  • 33,936
  • 20
  • 234
  • 300
  • It looks like this is about removing the BOM ("``") from the start of a string? This question is about "`�`", which is the Unicode Replacement Character. – mwfearnley Jun 29 '22 at 09:13
0

dissect the URL code and unicode error. this symbol came to me as well on google translate in the armenian text and sometimes the broken burmese.

0

profilage bas� sur l'analyse de l'esprit (french)

should be translated as:

profilage basé sur l'analyse de l'esprit

so, in this case � = é

Just Me
  • 864
  • 2
  • 18
  • 28
  • 1
    In the general case, `�` could be a stand-in for any non-ASCII character. (When it comes to UTF8->Latin1 Mojibake, I think `é` would be mangled as `é`.) – mwfearnley Jun 29 '22 at 09:22
-3

No above answer resolve my issue. When i download xml it apppends <xml to my xml. I simply

xml = parser.getXmlFromUrl(url);

xml = xml.substring(3);// it remove first three character from string,

now it is running accurately.

Zar E Ahmer
  • 33,936
  • 20
  • 234
  • 300
  • Glad you resolved your own issue, but what you're seeing is a different character - the BOM (byte order mark) - PREpended to your XML. Whatever XML parser you're using is evidently not designed to detect or work with different character encodings. – mwfearnley Jun 29 '22 at 09:10