21

I have a xml file which contains arabic characters.When i try to parse a file,it arise the Exception,MalformedByteSequenceException: Invalid byte 2 of 2-byte UTF-8 sequence.I Use POI DOM for parse the document.

The Log is,

2012-03-19 11:30:00,433 [ERROR] (com.infomindz.remitglobe.bll.remittance.BlackListBean) - Error 

com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: Invalid byte 2 of 2-byte UTF-8 sequence.

    at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.invalidByte(Unknown Source)

    at com.sun.org.apache.xerces.internal.impl.io.UTF8Reader.read(Unknown Source)

    at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.load(Unknown Source)

    at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.skipChar(Unknown Source)

    at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(Unknown Source)

    at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(Unknown Source)

    at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)

    at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown Source)

    at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown Source)

    at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(Unknown Source)

    at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(Unknown Source)

    at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(Unknown Source)

    at javax.xml.parsers.DocumentBuilder.parse(Unknown Source)

    at com.infomindz.remitglobe.bll.remittance.BlackListBean.updateGeneralBlackListDetail(Unknown Source)

    at com.infomindz.remitglobe.bll.remittance.schedulers.BlackListUpdateScheduler.executeInternal(Unknown Source)

    at org.springframework.scheduling.quartz.QuartzJobBean.execute(QuartzJobBean.java:86)

    at org.quartz.core.JobRunShell.run(JobRunShell.java:216)

    at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:549)

The exception arise only in windows Machine,not arise in Linux Machine.How can i resolve the issue.Any suggestion should be appreciable.

6 Answers6

14

I have resolve the problem,by create the XML file using UTF8 format.

OutputStreamWriter bufferedWriter = new OutputStreamWriter(filePath +
                        System.getProperty("file.separator") + fileName), "UTF8");

After create the file using the above code,the encoding problem is resolved.Thanks for every one,put the effort here.

  • 3
    This is the solution that worked for me, but I had to make a little change: OutputStream os = new FileOutputStream(file); and OutputStreamWriter bufferedWriter = new OutputStreamWriter(os, "UTF8"); – maxivis Sep 02 '13 at 14:26
12

you can add a jvm parameter -Dfile.encoding=utf-8 to your jvm.

Hsin
  • 121
  • 1
  • 3
3

Quite simple solution:

File file = new File("c:\\file-utf.xml");
InputStream inputStream= new FileInputStream(file);
Reader reader = new InputStreamReader(inputStream,"UTF-8");

InputSource is = new InputSource(reader);
// is.setEncoding("UTF-8"); -> This line causes error! Content is not allowed in prolog

saxParser.parse(is, handler);

Ref: http://www.mkyong.com/java/sax-error-malformedbytesequenceexception-invalid-byte-1-of-1-byte-utf-8-sequence/

Raaam
  • 95
  • 11
3

All we can tell from the message is that the file is not properly encoded in UTF-8. To work out why, you will need to trace the history of how the file was created. It may (or may not) be helpful to study the file contents at the binary level to see what the actual encoding is. For example, it may be useful to know whether the whole file is in the wrong encoding, or whether it just contains a couple of stray characters in the wrong encoding.

Michael Kay
  • 156,231
  • 11
  • 92
  • 164
0

I think that your parser expect a byte encoded in UTF-8 and receives it in different encoding. Check the file's encoding.

A possible solution may be converting the file to UTF-8.

If you have a unix system, you can use this tool

iconv -f original_charset -t utf-8 your_file > new_file
user219882
  • 15,274
  • 23
  • 93
  • 138
0

this is OS-based start document character. You should use some byte-viewer and delete it from your document. You can try to use something like unix2dos to converts control characters.

Alex Stybaev
  • 4,623
  • 3
  • 30
  • 44