1

I'm trying to read the text from a PDF using Selenium-web driver and the PDFbox API. If possible I don't want to download the file, but only read the PDF from the web getting only the text of PDF into a string. The code I'm using its below, can't make to work though:

I've found examples of code to download the PDF and comparing it using the file downloaded, but none functional example extracting the text of the PDF from the URL.

import java.awt.event.ActionEvent;
import java.awt.event.ActionListener;
import java.io.InputStream;
import java.net.HttpURLConnection;
import java.net.URL;
import java.util.Arrays;
import java.util.List;
import java.util.stream.Collectors;

import javax.swing.JDialog;
import javax.swing.JOptionPane;
import javax.swing.Timer;

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import org.openqa.selenium.By;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.chrome.ChromeDriver;

public class PDFextract {


        public static void main(String[] args) throws Exception {
            // TODO Auto-generated method stub
            System.setProperty("webdriver.chrome.driver", "C:\\chromedriver.exe");
            WebDriver driver=new ChromeDriver();
            driver.manage().window().maximize();
            driver.get("THE URL OF SITE I CANT SHARE"); //THE URL OF SITE I CAN'T SHARE
            System.out.println(driver.getTitle());          
            List<WebElement> list = driver.findElements(By.xpath("//a[@title='Click to open file']"));
            int rows = list.size();
            for (int i= 1; i <= rows; i++) {
            }
            List<WebElement> links = driver.findElements(By.xpath("//a[@title='Click to open file']"));
        String fLinks = "";
        for (WebElement link : links) {
             fLinks = fLink + link.getAttribute("href");
        }
        fLinks = fLinks.trim();
        System.out.println(fLinks); // till here the code works fine.. i get a valid url link

        // the code bellow doesn't work
        URL url=new URL(fLinks);
        HttpURLConnection connection=(HttpURLConnection)url.openConnection();
        InputStream is=connection.getInputStream();
        PDDocument pdd=PDDocument.load(is);
        PDFTextStripper stripper=new PDFTextStripper();
        String text=stripper.getText(pdd);
        pdd.close();
        is.close();
        System.out.println(text);

I get the error:

Exception in thread "main" java.io.IOException: Server returned HTTP response code: 500 for URL: ***AS TOLD ABOVE, I CANT SHARE THE URL***
at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(Unknown Source)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(Unknown Source)
at 

sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(Unknown Source)
    at PDFextract.main(PDFextract.java:106)

Edited in 07.05.2020: @TilmanHausherr, I've done more research, this helped out in the first part, how to read a PDF from a link: Selenium Tutorial: Read PDF Content using Selenium WebDriver

This method works:

String pdfContent = readPDFContent(driver.getCurrentUrl());

    public String readPDFContent(String appUrl) throws Exception {
    URL url = new URL(appUrl);
    InputStream is = url.openStream();
    BufferedInputStream fileToParse = new BufferedInputStream(is);
    PDDocument document = null;
    String output = null;
    try {
        document = PDDocument.load(fileToParse);
        output = new PDFTextStripper().getText(document);
        System.out.println(output);
    } finally {
        if (document != null) {
            document.close();
        }
        fileToParse.close();
        is.close();
    }
    return output;
}

It seems my problem its the link itself, the HTML element its '< embed >', in my case there is also a 'stream-URL':

<embed id="plugin" type="application/x-google-chrome-pdf" 

src="https://"SITE 
I CAN'T TELL"/file.do? _tr=4d51599fead209bc4ef42c6e5c4839c9bebc2fc46addb11a" 
stream-URL="chrome-extension://mhjfbmdgcfjojefgiehjai/6958a80-4342-43fc-
838a-1dbd07fa2fc1" headers="accept-ranges: bytes
content-disposition: inline;filename=&quot;online.pdf&quot;
content-length: 71488
content-security-policy: frame-ancestors 'self' https://*"SITE I CAN'T TELL" 
https://*"DOMAIN I CAN'T TELL".net
content-type: application/pdf

Found this: 1. Download the File which has stream-url is the chrome extension in the embed tag using selenium 2. Handling contents of Embed tag in selenium python

But I still didn't manage to read the PDF with PDFbox because the element its '< embed>' and i might have to access the stream-URL.

Alex
  • 11
  • 3
  • 1
    This is not PDFBox problem, this is your http server returning error 500. Check your server logs. – Tilman Hausherr May 03 '20 at 17:50
  • If you enter the same URL into your browser, does it also bring error 500? If not, then it's likely you need to set the user agent. – Tilman Hausherr May 04 '20 at 03:43
  • If I enter the same URL I get no error at all. Can you explain how do I set the user agent with selenium java for chrome? – Alex May 04 '20 at 10:15
  • I've tryed: "DesiredCapabilities handlSSLErr = DesiredCapabilities.chrome (); handlSSLErr.setCapability (CapabilityType.ACCEPT_SSL_CERTS, true); driver = new ChromeDriver(handlSSLErr);" AND "ChromeOptions options = new ChromeOptions(); options.addArguments("--user-agent=Mozilla/5.0 (Linux; Android 6.0; HTC One M9 Build/MRA58K) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.98 Mobile Safari/537.36"); options.addArguments("--start-maximized"); WebDriver driver = new ChromeDriver(options);" But neither of the above worked out. – Alex May 04 '20 at 11:17
  • https://stackoverflow.com/questions/2529682/ – Tilman Hausherr May 04 '20 at 11:49
  • Tried to fix with the code: `System.setProperty("http.agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36");` but it didn't worked out, still get the same error that i did before. By the way, its an "https:" domain. @TilmanHausherr – Alex May 04 '20 at 16:02
  • Then you should really check the server logs to see what comes there, and what is the real reason. Try also setting the property from the command line. (see the linked answer why) – Tilman Hausherr May 04 '20 at 17:51

0 Answers0