MalformedByteSequenceException: Invalid byte 2 of 2-byte UTF-8 sequence

Question

I have a xml file which contains arabic characters.When i try to parse a file,it arise the Exception,MalformedByteSequenceException: Invalid byte 2 of 2-byte UTF-8 sequence.I Use POI DOM for parse the document.

The Log is,

2012-03-19 11:30:00,433 [ERROR] (com.infomindz.remitglobe.bll.remittance.BlackListBean) - Error 

com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: Invalid byte 2 of 2-byte UTF-8 sequence.

    at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.invalidByte(Unknown Source)

    at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.read(Unknown Source)

    at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.load(Unknown Source)

    at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.skipChar(Unknown Source)

    at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(Unknown Source)

    at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(Unknown Source)

    at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)

    at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown Source)

    at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown Source)

    at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(Unknown Source)

    at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(Unknown Source)

    at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(Unknown Source)

    at javax.xml.parsers.DocumentBuilder.parse(Unknown Source)

    at com.infomindz.remitglobe.bll.remittance.BlackListBean.updateGeneralBlackListDetail(Unknown Source)

    at com.infomindz.remitglobe.bll.remittance.schedulers.BlackListUpdateScheduler.executeInternal(Unknown Source)

    at org.springframework.scheduling.quartz.QuartzJobBean.execute(QuartzJobBean.java:86)

    at org.quartz.core.JobRunShell.run(JobRunShell.java:216)

    at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:549)

The exception arise only in windows Machine,not arise in Linux Machine.How can i resolve the issue.Any suggestion should be appreciable.

In my case, no arabic character, but I did need to include xml encoding decalration `` — levolutionniste, Apr 29 '21 at 09:35

score 14 · Accepted Answer · answered Apr 03 '12 at 01:28

14

I have resolve the problem,by create the XML file using UTF8 format.

OutputStreamWriter bufferedWriter = new OutputStreamWriter(filePath +
                        System.getProperty("file.separator") + fileName), "UTF8");

After create the file using the above code,the encoding problem is resolved.Thanks for every one,put the effort here.

answered Apr 03 '12 at 01:28

Muneeswaran Balasubramanian

3,839
8
31
43

3

This is the solution that worked for me, but I had to make a little change: OutputStream os = new FileOutputStream(file); and OutputStreamWriter bufferedWriter = new OutputStreamWriter(os, "UTF8"); – maxivis Sep 02 '13 at 14:26

score 12 · Answer 2 · answered Jun 19 '15 at 06:49

12

you can add a jvm parameter -Dfile.encoding=utf-8 to your jvm.

answered Jun 19 '15 at 06:49

Hsin

121
1
3

Raaam · Answer 3 · 2015-08-18T06:35:30.930

Quite simple solution:

File file = new File("c:\\file-utf.xml");
InputStream inputStream= new FileInputStream(file);
Reader reader = new InputStreamReader(inputStream,"UTF-8");

InputSource is = new InputSource(reader);
// is.setEncoding("UTF-8"); -> This line causes error! Content is not allowed in prolog

saxParser.parse(is, handler);

Ref: http://www.mkyong.com/java/sax-error-malformedbytesequenceexception-invalid-byte-1-of-1-byte-utf-8-sequence/

score 3 · Answer 4 · answered Mar 29 '12 at 11:25

All we can tell from the message is that the file is not properly encoded in UTF-8. To work out why, you will need to trace the history of how the file was created. It may (or may not) be helpful to study the file contents at the binary level to see what the actual encoding is. For example, it may be useful to know whether the whole file is in the wrong encoding, or whether it just contains a couple of stray characters in the wrong encoding.

score 0 · Answer 5 · answered Mar 29 '12 at 07:23

I think that your parser expect a byte encoded in UTF-8 and receives it in different encoding. Check the file's encoding.

A possible solution may be converting the file to UTF-8.

If you have a unix system, you can use this tool

iconv -f original_charset -t utf-8 your_file > new_file

score 0 · Answer 6 · answered Mar 29 '12 at 07:29

0

this is OS-based start document character. You should use some byte-viewer and delete it from your document. You can try to use something like unix2dos to converts control characters.

answered Mar 29 '12 at 07:29

Alex Stybaev

4,623
3
30
44

MalformedByteSequenceException: Invalid byte 2 of 2-byte UTF-8 sequence

6 Answers6

Linked