I'm trying to read the text from a PDF using Selenium-web driver and the PDFbox API. If possible I don't want to download the file, but only read the PDF from the web getting only the text of PDF into a string. The code I'm using its below, can't make to work though:
I've found examples of code to download the PDF and comparing it using the file downloaded, but none functional example extracting the text of the PDF from the URL.
import java.awt.event.ActionEvent;
import java.awt.event.ActionListener;
import java.io.InputStream;
import java.net.HttpURLConnection;
import java.net.URL;
import java.util.Arrays;
import java.util.List;
import java.util.stream.Collectors;
import javax.swing.JDialog;
import javax.swing.JOptionPane;
import javax.swing.Timer;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import org.openqa.selenium.By;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.chrome.ChromeDriver;
public class PDFextract {
public static void main(String[] args) throws Exception {
// TODO Auto-generated method stub
System.setProperty("webdriver.chrome.driver", "C:\\chromedriver.exe");
WebDriver driver=new ChromeDriver();
driver.manage().window().maximize();
driver.get("THE URL OF SITE I CANT SHARE"); //THE URL OF SITE I CAN'T SHARE
System.out.println(driver.getTitle());
List<WebElement> list = driver.findElements(By.xpath("//a[@title='Click to open file']"));
int rows = list.size();
for (int i= 1; i <= rows; i++) {
}
List<WebElement> links = driver.findElements(By.xpath("//a[@title='Click to open file']"));
String fLinks = "";
for (WebElement link : links) {
fLinks = fLink + link.getAttribute("href");
}
fLinks = fLinks.trim();
System.out.println(fLinks); // till here the code works fine.. i get a valid url link
// the code bellow doesn't work
URL url=new URL(fLinks);
HttpURLConnection connection=(HttpURLConnection)url.openConnection();
InputStream is=connection.getInputStream();
PDDocument pdd=PDDocument.load(is);
PDFTextStripper stripper=new PDFTextStripper();
String text=stripper.getText(pdd);
pdd.close();
is.close();
System.out.println(text);
I get the error:
Exception in thread "main" java.io.IOException: Server returned HTTP response code: 500 for URL: ***AS TOLD ABOVE, I CANT SHARE THE URL***
at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(Unknown Source)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(Unknown Source)
at
sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(Unknown Source)
at PDFextract.main(PDFextract.java:106)
Edited in 07.05.2020: @TilmanHausherr, I've done more research, this helped out in the first part, how to read a PDF from a link: Selenium Tutorial: Read PDF Content using Selenium WebDriver
This method works:
String pdfContent = readPDFContent(driver.getCurrentUrl());
public String readPDFContent(String appUrl) throws Exception {
URL url = new URL(appUrl);
InputStream is = url.openStream();
BufferedInputStream fileToParse = new BufferedInputStream(is);
PDDocument document = null;
String output = null;
try {
document = PDDocument.load(fileToParse);
output = new PDFTextStripper().getText(document);
System.out.println(output);
} finally {
if (document != null) {
document.close();
}
fileToParse.close();
is.close();
}
return output;
}
It seems my problem its the link itself, the HTML element its '< embed >', in my case there is also a 'stream-URL':
<embed id="plugin" type="application/x-google-chrome-pdf"
src="https://"SITE
I CAN'T TELL"/file.do? _tr=4d51599fead209bc4ef42c6e5c4839c9bebc2fc46addb11a"
stream-URL="chrome-extension://mhjfbmdgcfjojefgiehjai/6958a80-4342-43fc-
838a-1dbd07fa2fc1" headers="accept-ranges: bytes
content-disposition: inline;filename="online.pdf"
content-length: 71488
content-security-policy: frame-ancestors 'self' https://*"SITE I CAN'T TELL"
https://*"DOMAIN I CAN'T TELL".net
content-type: application/pdf
Found this: 1. Download the File which has stream-url is the chrome extension in the embed tag using selenium 2. Handling contents of Embed tag in selenium python
But I still didn't manage to read the PDF with PDFbox because the element its '< embed>' and i might have to access the stream-URL.