java.io.IOException: Error: End-of-File, expected line Issue with PDFBox

Question

I am trying to read the PDF text from the PDF which is opened in the browser.

After clicking on a button 'Print' the below URL opens up in the new tab.

https://myappurl.com/employees/2Jb_rpRC710XGvs8xHSOmHE9_LGkL97j/details/listprint.pdf?ids%5B%5D=2Jb_rpRC711lmIvMaBdxnzJj_ZfipcXW

I have executed the same program with other web URLs and found to be working fine. I have used the same code that is used here (Extract PDF text).

And i am using the below versions of PDFBox.

    <dependency>
    <groupId>org.apache.pdfbox</groupId>
    <artifactId>pdfbox</artifactId>
    <version>1.8.9</version>
</dependency>
<dependency>
<dependency>
    <groupId>org.apache.pdfbox</groupId>
    <artifactId>fontbox</artifactId>
    <version>1.8.9</version>
</dependency>

Below is the code that is working fine with other URLS :

public boolean verifyPDFContent(String strURL, String reqTextInPDF) {

    boolean flag = false;

    PDFTextStripper pdfStripper = null;
    PDDocument pdDoc = null;
    COSDocument cosDoc = null;
    String parsedText = null;

    try {
        URL url = new URL(strURL);
        BufferedInputStream file = new BufferedInputStream(url.openStream());
        PDFParser parser = new PDFParser(file);

        parser.parse();
        cosDoc = parser.getDocument();
        pdfStripper = new PDFTextStripper();
        pdfStripper.setStartPage(1);
        pdfStripper.setEndPage(1);

        pdDoc = new PDDocument(cosDoc);
        parsedText = pdfStripper.getText(pdDoc);
    } catch (MalformedURLException e2) {
        System.err.println("URL string could not be parsed "+e2.getMessage());
    } catch (IOException e) {
        System.err.println("Unable to open PDF Parser. " + e.getMessage());
        try {
            if (cosDoc != null)
                cosDoc.close();
            if (pdDoc != null)
                pdDoc.close();
        } catch (Exception e1) {
            e.printStackTrace();
        }
    }

    System.out.println("+++++++++++++++++");
    System.out.println(parsedText);
    System.out.println("+++++++++++++++++");

    if(parsedText.contains(reqTextInPDF)) {
        flag=true;
    }

    return flag;
}

And The below is the Stacktrace of the exception that im getting

java.io.IOException: Error: End-of-File, expected line
at org.apache.pdfbox.pdfparser.BaseParser.readLine(BaseParser.java:1517)
at org.apache.pdfbox.pdfparser.PDFParser.parseHeader(PDFParser.java:372)
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:186)
at com.kareo.utils.PDFManager.getPDFContent(PDFManager.java:26)

Updating the image which i took when debugging at URL and File. enter image description here Please help me out. Is this something with 'https'???

Are you sure that the input file is a pdf created using a pdf creation software? It is common for pdfs to be just a concerted img. In which case you need ocr implementation. — Ya Wang, Apr 13 '15 at 18:44
The correct code is PDDocument doc = PDDocument.load() or (better) .loadNonSeq(). I can't tell if that is the cause of the problem. The error message indicates that %PDF is missing. You should verify that url.openStream() really returns a PDF file content. — Tilman Hausherr, Apr 13 '15 at 18:50
@Invexity That is opened as a PDF. I was able to download to local machine and read it. But i was not able to read it. — Dev Raj, Apr 14 '15 at 01:30
@TilmanHausherr exactly ` parser.parse();` at this position i get error. But when i tried to debug see the image that i updated now for details if this might help some way. — Dev Raj, Apr 14 '15 at 02:26
The image also indicates that the stream is empty. To check this, read your https stream into a byte array and see what size is read. Downloading with a browser may not be the same as reading with java. (proxy ?) — Tilman Hausherr, Apr 14 '15 at 06:11
https://stackoverflow.com/questions/34871270/merge-files-gives-error-end-of-file-expected-line - Try this one. — Sudha Velan, Jun 23 '17 at 11:58
Nothing was wrong in my code. I resolved it by finding that the PDFs I was merging were corrupted/unable to open. — Sanket Mehta, Dec 01 '17 at 06:35

score 0 · Answer 1 · answered Aug 04 '22 at 08:34

We all know that file stream is like a pipe. Once the data flows past, it cannot be used again. so you can: 1.Convert input stream to file.

public void useInputStreamTwiceBySaveToDisk(InputStream inputStream) { 
    String desPath = "test001.bin";
    try (BufferedInputStream is = new BufferedInputStream(inputStream);
         BufferedOutputStream os = new BufferedOutputStream(new FileOutputStream(desPath))) { 
        int len;
        byte[] buffer = new byte[1024];
        while ((len = is.read(buffer)) != -1) { 
            os.write(buffer, 0, len);
        }
    } catch (IOException e) { 
        e.printStackTrace();
    }
    
    File file = new File(desPath);
    StringBuilder sb = new StringBuilder();
    try (BufferedInputStream is = new BufferedInputStream(new FileInputStream(file))) { 
        int len;
        byte[] buffer = new byte[1024];
        while ((len = is.read(buffer)) != -1) { 
            sb.append(new String(buffer, 0, len));
        }
        System.out.println(sb.toString());
    } catch (IOException e) { 
        e.printStackTrace();
    }
}

2.Convert input stream to data.

public void useInputStreamTwiceSaveToByteArrayOutputStream(InputStream inputStream) { 
    ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
    try { 
        byte[] buffer = new byte[1024];
        int len;
        while ((len = inputStream.read(buffer)) != -1) { 
            outputStream.write(buffer, 0, len);
        }
    } catch (IOException e) { 
        e.printStackTrace();
    }
    // first read InputStream
    InputStream inputStream1 = new ByteArrayInputStream(outputStream.toByteArray());
    printInputStreamData(inputStream1);
    // second read InputStream
    InputStream inputStream2 = new ByteArrayInputStream(outputStream.toByteArray());
    printInputStreamData(inputStream2);
}

3.Marking and resetting with input stream.

public void useInputStreamTwiceByUseMarkAndReset(InputStream inputStream) { 
    StringBuilder sb = new StringBuilder();
    try (BufferedInputStream bufferedInputStream = new BufferedInputStream(inputStream, 10)) { 
        byte[] buffer = new byte[1024];
        //Call the mark method to mark
        //The number of bytes allowed to be read by the flag set here after reset is the maximum value of an integer
        bufferedInputStream.mark(bufferedInputStream.available() + 1);
        int len;
        while ((len = bufferedInputStream.read(buffer)) != -1) { 
            sb.append(new String(buffer, 0, len));
        }
        System.out.println(sb.toString());
        // After the first call, explicitly call the reset method to reset the flow
        bufferedInputStream.reset();
        // Read the second stream
        sb = new StringBuilder();
        int len1;
        while ((len1 = bufferedInputStream.read(buffer)) != -1) { 
            sb.append(new String(buffer, 0, len1));
        }
        System.out.println(sb.toString());
    } catch (IOException e) { 
        e.printStackTrace();
    }
}

then you can repeat the read operation for the same input stream many times.

This does not solve the issue of the question and is only loosely related to its topic. — mkl, Sep 04 '22 at 06:25

java.io.IOException: Error: End-of-File, expected line Issue with PDFBox

1 Answers1