9

I am trying to read the PDF text from the PDF which is opened in the browser.

After clicking on a button 'Print' the below URL opens up in the new tab.

https://myappurl.com/employees/2Jb_rpRC710XGvs8xHSOmHE9_LGkL97j/details/listprint.pdf?ids%5B%5D=2Jb_rpRC711lmIvMaBdxnzJj_ZfipcXW

I have executed the same program with other web URLs and found to be working fine. I have used the same code that is used here (Extract PDF text).

And i am using the below versions of PDFBox.

    <dependency>
    <groupId>org.apache.pdfbox</groupId>
    <artifactId>pdfbox</artifactId>
    <version>1.8.9</version>
</dependency>
<dependency>
<dependency>
    <groupId>org.apache.pdfbox</groupId>
    <artifactId>fontbox</artifactId>
    <version>1.8.9</version>
</dependency>

Below is the code that is working fine with other URLS :

public boolean verifyPDFContent(String strURL, String reqTextInPDF) {

    boolean flag = false;

    PDFTextStripper pdfStripper = null;
    PDDocument pdDoc = null;
    COSDocument cosDoc = null;
    String parsedText = null;

    try {
        URL url = new URL(strURL);
        BufferedInputStream file = new BufferedInputStream(url.openStream());
        PDFParser parser = new PDFParser(file);

        parser.parse();
        cosDoc = parser.getDocument();
        pdfStripper = new PDFTextStripper();
        pdfStripper.setStartPage(1);
        pdfStripper.setEndPage(1);

        pdDoc = new PDDocument(cosDoc);
        parsedText = pdfStripper.getText(pdDoc);
    } catch (MalformedURLException e2) {
        System.err.println("URL string could not be parsed "+e2.getMessage());
    } catch (IOException e) {
        System.err.println("Unable to open PDF Parser. " + e.getMessage());
        try {
            if (cosDoc != null)
                cosDoc.close();
            if (pdDoc != null)
                pdDoc.close();
        } catch (Exception e1) {
            e.printStackTrace();
        }
    }

    System.out.println("+++++++++++++++++");
    System.out.println(parsedText);
    System.out.println("+++++++++++++++++");

    if(parsedText.contains(reqTextInPDF)) {
        flag=true;
    }

    return flag;
}

And The below is the Stacktrace of the exception that im getting

java.io.IOException: Error: End-of-File, expected line
at org.apache.pdfbox.pdfparser.BaseParser.readLine(BaseParser.java:1517)
at org.apache.pdfbox.pdfparser.PDFParser.parseHeader(PDFParser.java:372)
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:186)
at com.kareo.utils.PDFManager.getPDFContent(PDFManager.java:26)

Updating the image which i took when debugging at URL and File. enter image description here Please help me out. Is this something with 'https'???

Dev Raj
  • 650
  • 2
  • 7
  • 18
  • Are you sure that the input file is a pdf created using a pdf creation software? It is common for pdfs to be just a concerted img. In which case you need ocr implementation. – Ya Wang Apr 13 '15 at 18:44
  • 1
    The correct code is PDDocument doc = PDDocument.load() or (better) .loadNonSeq(). I can't tell if that is the cause of the problem. The error message indicates that %PDF is missing. You should verify that url.openStream() really returns a PDF file content. – Tilman Hausherr Apr 13 '15 at 18:50
  • @Invexity That is opened as a PDF. I was able to download to local machine and read it. But i was not able to read it. – Dev Raj Apr 14 '15 at 01:30
  • @TilmanHausherr exactly ` parser.parse();` at this position i get error. But when i tried to debug see the image that i updated now for details if this might help some way. – Dev Raj Apr 14 '15 at 02:26
  • 2
    The image also indicates that the stream is empty. To check this, read your https stream into a byte array and see what size is read. Downloading with a browser may not be the same as reading with java. (proxy ?) – Tilman Hausherr Apr 14 '15 at 06:11
  • @Dev Raj Did you find the solution to your problem? – beterthanlife Dec 04 '15 at 15:50
  • @DevRaj Did you find the solution? – Ayush Mishra May 12 '16 at 12:08
  • @DevRaj Did you find the solution? – Benj Mar 01 '17 at 10:06
  • https://stackoverflow.com/questions/34871270/merge-files-gives-error-end-of-file-expected-line - Try this one. – Sudha Velan Jun 23 '17 at 11:58
  • Nothing was wrong in my code. I resolved it by finding that the PDFs I was merging were corrupted/unable to open. – Sanket Mehta Dec 01 '17 at 06:35

1 Answers1

0

We all know that file stream is like a pipe. Once the data flows past, it cannot be used again. so you can: 1.Convert input stream to file.

public void useInputStreamTwiceBySaveToDisk(InputStream inputStream) { 
    String desPath = "test001.bin";
    try (BufferedInputStream is = new BufferedInputStream(inputStream);
         BufferedOutputStream os = new BufferedOutputStream(new FileOutputStream(desPath))) { 
        int len;
        byte[] buffer = new byte[1024];
        while ((len = is.read(buffer)) != -1) { 
            os.write(buffer, 0, len);
        }
    } catch (IOException e) { 
        e.printStackTrace();
    }
    
    File file = new File(desPath);
    StringBuilder sb = new StringBuilder();
    try (BufferedInputStream is = new BufferedInputStream(new FileInputStream(file))) { 
        int len;
        byte[] buffer = new byte[1024];
        while ((len = is.read(buffer)) != -1) { 
            sb.append(new String(buffer, 0, len));
        }
        System.out.println(sb.toString());
    } catch (IOException e) { 
        e.printStackTrace();
    }
}

2.Convert input stream to data.

public void useInputStreamTwiceSaveToByteArrayOutputStream(InputStream inputStream) { 
    ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
    try { 
        byte[] buffer = new byte[1024];
        int len;
        while ((len = inputStream.read(buffer)) != -1) { 
            outputStream.write(buffer, 0, len);
        }
    } catch (IOException e) { 
        e.printStackTrace();
    }
    // first read InputStream
    InputStream inputStream1 = new ByteArrayInputStream(outputStream.toByteArray());
    printInputStreamData(inputStream1);
    // second read InputStream
    InputStream inputStream2 = new ByteArrayInputStream(outputStream.toByteArray());
    printInputStreamData(inputStream2);
}

3.Marking and resetting with input stream.

public void useInputStreamTwiceByUseMarkAndReset(InputStream inputStream) { 
    StringBuilder sb = new StringBuilder();
    try (BufferedInputStream bufferedInputStream = new BufferedInputStream(inputStream, 10)) { 
        byte[] buffer = new byte[1024];
        //Call the mark method to mark
        //The number of bytes allowed to be read by the flag set here after reset is the maximum value of an integer
        bufferedInputStream.mark(bufferedInputStream.available() + 1);
        int len;
        while ((len = bufferedInputStream.read(buffer)) != -1) { 
            sb.append(new String(buffer, 0, len));
        }
        System.out.println(sb.toString());
        // After the first call, explicitly call the reset method to reset the flow
        bufferedInputStream.reset();
        // Read the second stream
        sb = new StringBuilder();
        int len1;
        while ((len1 = bufferedInputStream.read(buffer)) != -1) { 
            sb.append(new String(buffer, 0, len1));
        }
        System.out.println(sb.toString());
    } catch (IOException e) { 
        e.printStackTrace();
    }
}

then you can repeat the read operation for the same input stream many times.

sophy
  • 1
  • This does not solve the issue of the question and is only loosely related to its topic. – mkl Sep 04 '22 at 06:25