1

I'm calling one REST API using java's HTTP client, API's content-type is application/pdf. I captured the API response in string format but the string contents are something like "%PDF-1.5%1 0 obj<>/Font<>>>/Contents 13 0 ".

How do I convert this response into the text from which I can scrape my required data?

Kaviranga
  • 576
  • 2
  • 10
  • 24
  • https://stackoverflow.com/questions/4784825/how-to-read-pdf-files-using-java – Conffusion Oct 20 '20 at 11:41
  • I'm more interested in creating .pdf file using the above-mentioned response. How can I create .pdf file so that later I can use PDFBox? I'm not able to create valid .pdf file using "%PDF-1.5%1 0 obj<>/Font<>>>/Contents 13 0 " – Rohan pawar Oct 20 '20 at 11:51
  • load the response content as an inputstream and pass it to pdfbox to instantiate a PDF document – Conffusion Oct 20 '20 at 11:52
  • Again getting some issues while using PDFBox on this PDF. Can you please replicate some hello word PDF data into the format like "%PDF-1.5%1 0 obj<>/Font<>>>/Contents 13 0 ". And then use PDFBox to fetch the content from it? I tried saving this contents into .pdf file but not able to open PDF using any PDF Reader. – Rohan pawar Oct 20 '20 at 12:40

2 Answers2

0

I have used Apache TIKA, read the input stream parse and get content. On the other hand, what you can do is:

  • capture string content
  • store in a temp file
  • read the file using pdfbox
  • read individual page and process the content
        @GetMapping("/pdf-test")
        public String pdfTest() throws IOException, TikaException, SAXException {
            final HttpHeaders headers = new HttpHeaders();
            headers.set("User-Agent", "stack-overflow-server");
            final HttpEntity<String> entity = new HttpEntity<String>(headers);
            final String testPdf1 = "http://www.africau.edu/images/default/sample.pdf";
            final String testPdf2 = "https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf";
            ResponseEntity<Resource> exchange = restTemplate.exchange(testPdf2, HttpMethod.GET, entity, Resource.class);
            InputStream pdfInputStream = exchange.getBody().getInputStream();
            PDFParser pdfParser = new PDFParser();
            BodyContentHandler handler = new BodyContentHandler();
            Metadata metadata = new Metadata();
            ParseContext pcontext = new ParseContext();
            pdfParser.parse(pdfInputStream, handler, metadata, pcontext);
            return handler.toString();
        }
    }

Dependencies are as below:

compile group: 'org.apache.tika', name: 'tika-core', version: '1.24.1'
compile group: 'org.apache.tika', name: 'tika-parsers', version: '1.24.1'
silentsudo
  • 6,730
  • 6
  • 39
  • 81
  • Can you please replicate some hello word PDF data into the format like "%PDF-1.5%1 0 obj<>/Font<>>>/Contents 13 0 ". And then try to fetch the content from it? I tried saving these contents into a .pdf and.txt file but not able to parse or open PDF using any PDF Reader. – Rohan pawar Oct 21 '20 at 05:24
  • can share me the pdf ? – silentsudo Oct 21 '20 at 13:27
0

We can capture the PDF data into byte array and then we can write this date to .pdf file and use PDFBox APIs to get required data from PDF.