0

I'm trying to read some xml files from a zip file using java.util.zip.ZipFile, I was hoping to get an input stream which I could then parse with a sax parser but keep getting Sax Exceptions due to faulty prologs. Meaning that I'm not getting what I expect out of the inputStream.

What am I missing?

if (path.endsWith(".zip")){
            ZipFile file = new ZipFile(path);
            Enumeration<? extends ZipEntry> entries = file.entries();
            while (entries.hasMoreElements()){
                methodThatHandlesXmlInputStream(file.getInputStream(entries.nextElement()));
            }
        }
void methodThatHandlesXmlInputStream(InputStream input){
     doSomethingToTheInput(input);
     tryToParseXMLFromInput(input); //This is where the exception was thrown
}

Revisited Solution: The problem was that the method that handled the InputStream consumed it and attempted to read from it again. I've learned that it is better to generate separate InputStreams from zip files and handle each separately.

 ZipFile zipFile = new ZipFile(path);
 Enumeration<? extends ZipEntry> entries = file.entries();
    while (entries.hasMoreElements()){
        ZipEntry entry = entries.nextElement();
        methodConsumingInput( zipFile.getInputStream(entry) );
        anotherMethodConsumingSameInput( zipFile.getInputStream(entry) );
LeedMx
  • 424
  • 4
  • 19
  • 1
    *"Sax Exceptions due to faulty prologs"* - Then I would suggest your issue isn't with the unzipping, but the XML file itself. Maybe [SAX Error – Content is not allowed in prolog](https://www.mkyong.com/java/sax-error-content-is-not-allowed-in-prolog/) or [org.xml.sax.SAXParseException: Content is not allowed in prolog](https://stackoverflow.com/questions/5138696/org-xml-sax-saxparseexception-content-is-not-allowed-in-prolog) can help – MadProgrammer Jul 02 '19 at 00:13
  • The XML is valid, I'm sure, I will read on about zipfile... Perhaps `ZipInputFileStream` – LeedMx Jul 02 '19 at 00:20
  • Will provide a minimal verifiable example first thing in the morning, really tired T_T – LeedMx Jul 02 '19 at 00:44
  • 1
    If you're getting a stack trace, please add it to your question. – Slaw Jul 02 '19 at 00:48
  • Make up your mind. Your title says the error happens in `ZipEntry.getInputStream()`. However a aSAXExceptiona cannot possibly do so. Please post the actual stack trace, which will show this clearly. – user207421 Jul 02 '19 at 04:19
  • @user207421 please refrain from downvoting if you are not going to read even the title, which does not say there is an error at `getInputStream()` it says that I'm getting an Input Stream that generates a Sax Exception. I cannot "make up my mind" on deciding where is the source of the problem. – LeedMx Jul 02 '19 at 16:28

1 Answers1

0

My guess is that getInputStream() returns a stream to the compressed xml file which would be unreadable.

If you are reading an entry that has been compressed by ZIP, that should not happen. The ZipFile classes will take care of the uncompression.

If the compression was done by something else before adding the entry to the ZIP file, then ZipFile won't be aware that it is compressed. You will need to:

  1. Figure out what compression scheme was used.
  2. Uncompress the stream yourself before you attempt to parse it. For example, wrap the result of getInputStream() with a DeflaterInputStream or similar.

A third possibility is that the stream is not well-formed XML ... or not XML at all.


Suggestion: Use a ZIP tool to extract the offending ZIP entry to a local file in the file system, then use a utility like the UNIX / Linux file command to figure out what the real file type is. (Don't trust the file suffix. It might be misleading you.)

Stephen C
  • 698,415
  • 94
  • 811
  • 1,216
  • Xml is valid and well formed, the compression was done by using windows default compressor, that might have something to do, ill try with with a file compressed in a different way. I don't rely solely on the extension, this is code from a unit test with controlled and verified input. – LeedMx Jul 02 '19 at 00:00
  • After decompressing the file seems to be ok ` XML 1.0 document, UTF-8 Unicode (with BOM) text, with very long lines`... I'm sure there's something I'm missing. – LeedMx Jul 02 '19 at 00:09
  • So did you manually decompress after extracting using the command line ZIP tool? If yes, then you need to modify your Java code to manually decompress the inputstream too. See 2nd paragraph of my answer. – Stephen C Jul 02 '19 at 00:19
  • Yes... After decompressing with Windows, gunzip and 7z the file was ok. Tried it by using a file compressed it with all of them but same issue. – LeedMx Jul 02 '19 at 00:23
  • @LeedMx That “(with BOM)” is your problem. See the question MadProgrammer linked to, in the comment above. Short version: BOM is a non-printing character at the very start of your XML file. The first character in your XML file is not, in fact, `<`, but rather `\ufeff`. – VGR Jul 02 '19 at 01:15
  • @VGR I see, will give it a try with a different encoding and let you know... But that will be tomorrow... Need to shut off – LeedMx Jul 02 '19 at 01:18
  • @VGR Same error with a different enconding, while making a minimal verifiable example I was not able to reproduce the Exception, which indicates me that some of my classes must be mishandling the ZipFile$ZipFileInflaterInputStream – LeedMx Jul 02 '19 at 16:30