How to extract text content from contect-type "application/pdf" in java

Question

I'm calling one REST API using java's HTTP client, API's content-type is application/pdf. I captured the API response in string format but the string contents are something like "%PDF-1.5%1 0 obj<>/Font<>>>/Contents 13 0 ".

How do I convert this response into the text from which I can scrape my required data?

https://stackoverflow.com/questions/4784825/how-to-read-pdf-files-using-java — Conffusion, Oct 20 '20 at 11:41
I'm more interested in creating .pdf file using the above-mentioned response. How can I create .pdf file so that later I can use PDFBox? I'm not able to create valid .pdf file using "%PDF-1.5%1 0 obj<>/Font<>>>/Contents 13 0 " — Rohan pawar, Oct 20 '20 at 11:51
load the response content as an inputstream and pass it to pdfbox to instantiate a PDF document — Conffusion, Oct 20 '20 at 11:52
Again getting some issues while using PDFBox on this PDF. Can you please replicate some hello word PDF data into the format like "%PDF-1.5%1 0 obj<>/Font<>>>/Contents 13 0 ". And then use PDFBox to fetch the content from it? I tried saving this contents into .pdf file but not able to open PDF using any PDF Reader. — Rohan pawar, Oct 20 '20 at 12:40

score 0 · Answer 1 · answered Oct 20 '20 at 12:45

I have used Apache TIKA, read the input stream parse and get content. On the other hand, what you can do is:

capture string content
store in a temp file
read the file using pdfbox
read individual page and process the content

        @GetMapping("/pdf-test")
        public String pdfTest() throws IOException, TikaException, SAXException {
            final HttpHeaders headers = new HttpHeaders();
            headers.set("User-Agent", "stack-overflow-server");
            final HttpEntity<String> entity = new HttpEntity<String>(headers);
            final String testPdf1 = "http://www.africau.edu/images/default/sample.pdf";
            final String testPdf2 = "https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf";
            ResponseEntity<Resource> exchange = restTemplate.exchange(testPdf2, HttpMethod.GET, entity, Resource.class);
            InputStream pdfInputStream = exchange.getBody().getInputStream();
            PDFParser pdfParser = new PDFParser();
            BodyContentHandler handler = new BodyContentHandler();
            Metadata metadata = new Metadata();
            ParseContext pcontext = new ParseContext();
            pdfParser.parse(pdfInputStream, handler, metadata, pcontext);
            return handler.toString();
        }
    }

Dependencies are as below:

compile group: 'org.apache.tika', name: 'tika-core', version: '1.24.1'
compile group: 'org.apache.tika', name: 'tika-parsers', version: '1.24.1'

Can you please replicate some hello word PDF data into the format like "%PDF-1.5%1 0 obj<>/Font<>>>/Contents 13 0 ". And then try to fetch the content from it? I tried saving these contents into a .pdf and.txt file but not able to parse or open PDF using any PDF Reader. — Rohan pawar, Oct 21 '20 at 05:24

score 0 · Answer 2 · answered Nov 03 '20 at 08:57

0

We can capture the PDF data into byte array and then we can write this date to .pdf file and use PDFBox APIs to get required data from PDF.

answered Nov 03 '20 at 08:57

Rohan pawar

21
6

How to extract text content from contect-type "application/pdf" in java

2 Answers2