How to read PDF contents in selenium

Question

I'm trying to verify the contents in PDF, I'm getting the URL using href and passing it in the below code. URL is with HTTPS, so I'm facing below issue. Can anyone help me how to proceed and help me to read pdf data . Thanks in advance

Retried URL is https://XXXXXXXXXXXXXXXXX/XXXX/XXXXXXXXXXX?docType=pdf&docid=2229123

        URL PDFUrl = new URL(url);
        BufferedInputStream TestFile = new BufferedInputStream(PDFUrl.openStream());
        PDFParser TestPDF = new PDFParser((RandomAccessRead) TestFile);
        TestPDF.parse();
        String TestText = new PDFTextStripper().getText(TestPDF.getPDDocument());
        System.out.println("Document Text is   "+   TestText);

error is

java.net.ConnectException: Connection timed out: connect
    at java.net.DualStackPlainSocketImpl.connect0(Native Method)
    at java.net.DualStackPlainSocketImpl.socketConnect(Unknown Source)
    at java.net.AbstractPlainSocketImpl.doConnect(Unknown Source)
    at java.net.AbstractPlainSocketImpl.connectToAddress(Unknown Source)
    at java.net.AbstractPlainSocketImpl.connect(Unknown Source)
    at java.net.PlainSocketImpl.connect(Unknown Source)
    at java.net.SocksSocketImpl.connect(Unknown Source)
    at java.net.Socket.connect(Unknown Source)
    at sun.security.ssl.SSLSocketImpl.connect(Unknown Source)
    at sun.security.ssl.BaseSSLSocketImpl.connect(Unknown Source)
    at sun.net.NetworkClient.doConnect(Unknown Source)
    at sun.net.www.http.HttpClient.openServer(Unknown Source)
    at sun.net.www.http.HttpClient.openServer(Unknown Source)
    at sun.net.www.protocol.https.HttpsClient.<init>(Unknown Source)
    at sun.net.www.protocol.https.HttpsClient.New(Unknown Source)
    at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.getNewHttpClient(Unknown Source)
    at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(Unknown Source)
    at sun.net.www.protocol.http.HttpURLConnection.plainConnect(Unknown Source)
    at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(Unknown Source)
    at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(Unknown Source)
    at sun.net.www.protocol.http.HttpURLConnection.getInputStream(Unknown Source)
    at sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(Unknown Source)
    at java.net.URL.openStream(Unknown Source)

I found a similar issue in this https://stackoverflow.com/questions/4784825/how-to-read-pdf-files-using-java Hopefully this will help.. — SeleniumTech, Mar 11 '20 at 12:10

score 0 · Answer 1 · answered Mar 11 '20 at 07:06

0

Are you setting the Accept SSL certs in the desired capabilities of the driver?

DesiredCapabilities dc = DesiredCapabilities.chrome ()       
dc.setCapability (CapabilityType.ACCEPT_SSL_CERTS, true)
WebDriver driver = new ChromeDriver (dc);

answered Mar 11 '20 at 07:06

rrgirish

331
2
3
15

Hi @rrgirish, I'm not using it. – Deepak_Mahalingam Mar 11 '20 at 07:57
Your stack trace is showing a SSL exception, you will need to set the SSL certs in your webdriver, the above code is for chrome. This article seems to have all the code for other browsers https://www.guru99.com/ssl-certificate-error-handling-selenium.html – rrgirish Mar 11 '20 at 14:55
Don't know if this is helpful, but my first thought is whether proxy settings are needed (corporate environment?) – Tilman Hausherr Mar 28 '20 at 13:20

score 0 · Answer 2 · answered Mar 12 '20 at 06:33

First Download pdfbox JAR 2.0.13 with all dependencies and import it. Now Read PDF file from URL.

public String readPDFInURL(String text) throws EmptyFileException, IOException {
        System.out.println("Enters into READ PDF");
        String output = "";
        URL url = new URL(driver.getCurrentUrl());
        System.out.println("url :  " + url);
        InputStream is = url.openStream();
        BufferedInputStream fileToParse = new BufferedInputStream(is);
        PDDocument document = null;
        try {
            document = PDDocument.load(fileToParse);
            output = new PDFTextStripper().getText(document);
            if (output.contains(text)) {
                System.out.println("Element is matched in PDF is : " + text);
                test.log(LogStatus.INFO, "Element is displayed in PDF " + text);
            } else {
                System.out.println("Element is not  matched in PDF");
                test.log(LogStatus.ERROR, "Element is not displayed in PDF :: " + text);
                throw new AssertionError("Element is not displayed" + text);
            }
        } finally {
            if (document != null) {
                document.close();
            }
            fileToParse.close();
            is.close();
        }
        return output;
    }

This answer is not helpful, the OP has a communication problem (timeout) which you don't attempt to solve. And don't use 2.0.13. The current version is 2.0.19. — Tilman Hausherr, Mar 28 '20 at 13:18

score 0 · Answer 3 · answered Mar 19 '20 at 13:16

You can add pdfbox jar dependency using Maven and start reading the pdf that is downloaded using Selenium or an existing pdf document.

For example :

  File file = new File("C:/PdfBox_Examples/new.pdf");
  PDDocument document = PDDocument.load(file);

  //Instantiate PDFTextStripper class
  PDFTextStripper pdfStripper = new PDFTextStripper();

  //Retrieving text from PDF document
  String text = pdfStripper.getText(document);
  System.out.println(text);

  //Closing the document
  document.close();

This answer is not helpful, the OP has a communication problem (timeout) which you don't attempt to solve. — Tilman Hausherr, Mar 28 '20 at 13:19

How to read PDF contents in selenium

3 Answers3