4

I have a simple xml file on my hard drive. When I open it with notepad++ this is what I see:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<content>
... more stuff here ...
</content>

But when I read it using a FileInputStream I get:

?<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<content>...

I'm using JAXB to parse xml's and it throws an exception of "content not allowed in prolog" because of that "?" sign.

What is this extra "?" sign? why is it there and how do I get rid of it?

Buhake Sindi
  • 87,898
  • 29
  • 167
  • 228
samz
  • 1,592
  • 3
  • 21
  • 37

6 Answers6

7

That extra character is a byte order mark, a special Unicode character code which lets the XML parser know what the byte order (little endian or big endian) of the bytes in the file is.

Normally, your XML parser should be able to understand this. (If it doesn't, I would regard that a bug in the XML parser).

As a workaround, make sure that the program that produces this XML leaves off the BOM.

Jesper
  • 202,709
  • 46
  • 318
  • 350
2

Check the encoding of the file, I've seen a similar thing, openeing the file in most editors and it looked fine, turned out it was encoded with UTF-8 without BOM (or with, I can't recall off the top of my head). Notepad++ should be ok to switch between the two.

Daniel Morritt
  • 1,787
  • 17
  • 25
1

You can use Notepad++ to see show all symbols from the View > Show Symbols > Show All Characters menu. It would show you the extra bytes present in the beginning. There is a possibility that it is the byte order mark. If the extra bytes are indeed byte order mark, this approach would not help. In that case, you will need to download a hex editor or if you have Cygwin installed, follow the steps in the last paragraph of this response. Once you can see the file in terms of hex codes, look for the first two characters. Do they have one of the codes mentioned at http://en.wikipedia.org/wiki/Byte_order_mark#Representations_of_byte_order_marks_by_encoding

If they indeed are byte order mark or if you are unable to determine the cause of the error, just try this:

From the menu select, Encoding > Encoding in UTF-8 without BOM, and then save the file.

(On Linux, one can use command line tools to check what's the in the beginning. e.g. xxd -g1 filename | head or od -t cx1 filename | head.)

Susam Pal
  • 32,765
  • 12
  • 81
  • 103
0

Next to the FileInputStream a ByteArrayInputStream worked also with me:

JAXB.unmarshal(new ByteArrayInputStream(string.getBytes("UTF-8")), Delivery.class);

=> No unmarshaling error anymore.

0

You might be having a newline. Delete that.

Select View > Show Symbol > Show All Characters in Notepad++ to see what's happening.

adarshr
  • 61,315
  • 23
  • 138
  • 167
0

this is not a jaxb problem, the problem resides in the way you use to read the xml ... try using an inputstream

...
Unmarshaller u = jaxbContext.createUnmarshaller();
XmlDataObject xmlDataObject = (XmlDataObject) u.unmarshal(new FileInputStream("foo.xml"));
...
A4L
  • 17,353
  • 6
  • 49
  • 70
  • Your right, it works using a FileInputStream. I'm working on a servlet that receive xmls and reads them in memory without writing them first to a file. So I was reading the file into a something temporary and only then passing it on to the xml parser. The xml parser wouldn;t accept the "temporary" input stream. – samz Feb 06 '12 at 16:41