I have used Apache TIKA, read the input stream parse and get content.
On the other hand, what you can do is:
- capture string content
- store in a temp file
- read the file using pdfbox
- read individual page and process the content
@GetMapping("/pdf-test")
public String pdfTest() throws IOException, TikaException, SAXException {
final HttpHeaders headers = new HttpHeaders();
headers.set("User-Agent", "stack-overflow-server");
final HttpEntity<String> entity = new HttpEntity<String>(headers);
final String testPdf1 = "http://www.africau.edu/images/default/sample.pdf";
final String testPdf2 = "https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf";
ResponseEntity<Resource> exchange = restTemplate.exchange(testPdf2, HttpMethod.GET, entity, Resource.class);
InputStream pdfInputStream = exchange.getBody().getInputStream();
PDFParser pdfParser = new PDFParser();
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
ParseContext pcontext = new ParseContext();
pdfParser.parse(pdfInputStream, handler, metadata, pcontext);
return handler.toString();
}
}
Dependencies are as below:
compile group: 'org.apache.tika', name: 'tika-core', version: '1.24.1'
compile group: 'org.apache.tika', name: 'tika-parsers', version: '1.24.1'