java reads a weird character at the beginning of the file which doesn't exist

Question

I have a simple xml file on my hard drive. When I open it with notepad++ this is what I see:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<content>
... more stuff here ...
</content>

But when I read it using a FileInputStream I get:

?<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<content>...

I'm using JAXB to parse xml's and it throws an exception of "content not allowed in prolog" because of that "?" sign.

What is this extra "?" sign? why is it there and how do I get rid of it?

possible duplicate of [Byte order mark screws up file reading in Java](http://stackoverflow.com/questions/1835430/byte-order-mark-screws-up-file-reading-in-java) — Edward Thomson, Feb 06 '12 at 15:54
You could also try deleting the first few characters and resaving. — Dave Newton, Feb 06 '12 at 15:56

score 7 · Accepted Answer · answered Feb 06 '12 at 16:02

That extra character is a byte order mark, a special Unicode character code which lets the XML parser know what the byte order (little endian or big endian) of the bytes in the file is.

Normally, your XML parser should be able to understand this. (If it doesn't, I would regard that a bug in the XML parser).

As a workaround, make sure that the program that produces this XML leaves off the BOM.

score 2 · Answer 2 · answered Feb 06 '12 at 15:57

Check the encoding of the file, I've seen a similar thing, openeing the file in most editors and it looked fine, turned out it was encoded with UTF-8 without BOM (or with, I can't recall off the top of my head). Notepad++ should be ok to switch between the two.

Susam Pal · Answer 3 · 2012-02-06T16:03:49.897

You can use Notepad++ to see show all symbols from the View > Show Symbols > Show All Characters menu. It would show you the extra bytes present in the beginning. There is a possibility that it is the byte order mark. If the extra bytes are indeed byte order mark, this approach would not help. In that case, you will need to download a hex editor or if you have Cygwin installed, follow the steps in the last paragraph of this response. Once you can see the file in terms of hex codes, look for the first two characters. Do they have one of the codes mentioned at http://en.wikipedia.org/wiki/Byte_order_mark#Representations_of_byte_order_marks_by_encoding

If they indeed are byte order mark or if you are unable to determine the cause of the error, just try this:

From the menu select, Encoding > Encoding in UTF-8 without BOM, and then save the file.

(On Linux, one can use command line tools to check what's the in the beginning. e.g. xxd -g1 filename | head or od -t cx1 filename | head.)

score 0 · Answer 4 · answered Apr 08 '13 at 13:46

0

Next to the FileInputStream a ByteArrayInputStream worked also with me:

JAXB.unmarshal(new ByteArrayInputStream(string.getBytes("UTF-8")), Delivery.class);

=> No unmarshaling error anymore.

answered Apr 08 '13 at 13:46

user1474357

9

score 0 · Answer 5 · answered Feb 06 '12 at 15:54

0

You might be having a newline. Delete that.

Select View > Show Symbol > Show All Characters in Notepad++ to see what's happening.

answered Feb 06 '12 at 15:54

adarshr

61,315
23
138
167

score 0 · Answer 6 · answered Feb 06 '12 at 16:05

0

this is not a jaxb problem, the problem resides in the way you use to read the xml ... try using an inputstream

...
Unmarshaller u = jaxbContext.createUnmarshaller();
XmlDataObject xmlDataObject = (XmlDataObject) u.unmarshal(new FileInputStream("foo.xml"));
...

answered Feb 06 '12 at 16:05

A4L

17,353
6
49
70

Your right, it works using a FileInputStream. I'm working on a servlet that receive xmls and reads them in memory without writing them first to a file. So I was reading the file into a something temporary and only then passing it on to the xml parser. The xml parser wouldn;t accept the "temporary" input stream. – samz Feb 06 '12 at 16:41

java reads a weird character at the beginning of the file which doesn't exist

6 Answers6