Is it possible to extract text by page for word/pdf files using Apache Tika?

Question

All the documentation I can find seems to suggest I can only extract the entire file's content. But I need to extract pages individually. Do I need to write my own parser for that? Is there some obvious method that I am missing?

topchef · Accepted Answer · 2011-06-17T04:09:26.587

Actually Tika does handle pages (at least in pdf) by sending elements <div><p> before page starts and </p></div> after page ends. You can easily setup page count in your handler using this (just counting pages using only <p>):

public abstract class MyContentHandler implements ContentHandler {
    private String pageTag = "p";
    protected int pageNumber = 0;
    ...
    @Override
    public void startElement (String uri, String localName, String qName, Attributes atts) throws SAXException  {  

        if (pageTag.equals(qName)) {
            startPage();
        }
    }

    @Override
    public void endElement (String uri, String localName, String qName) throws SAXException {  

        if (pageTag.equals(qName)) {
            endPage();
        }
    }

    protected void startPage() throws SAXException {
    pageNumber++;
    }

    protected void endPage() throws SAXException {
    return;
    }
    ...
}

When doing this with pdf you may run into the problem when parser doesn't send text lines in proper order - see Extracting text from PDF files with Apache Tika 0.9 (and PDFBox under the hood) on how to handle this.

Just counting
tags also counts normal paragraphs, not just pages, at least for me. — Philipp Nowak, Jul 07 '16 at 07:45

score 5 · Answer 2 · edited Oct 14 '15 at 16:47

5

You can get the number of pages in a Pdf using the metadata object's xmpTPg:NPages key as in the following:

Parser parser = new AutoDetectParser();
Metadata metadata = new Metadata();
ParseContext parseContext = new ParseContext();
parser.parse(fis, handler, metadata, parseContext);
metadata.get("xmpTPg:NPages");

edited Oct 14 '15 at 16:47

Community

1
1

answered Jul 24 '13 at 21:22

hd1

33,938
5
80
91

1

This doesn't answer the actual question. The question is not about how to get the total number of pages but about how to extract text on a page by page basis. – Abraham Milano Apr 16 '20 at 21:24

score 5 · Answer 3 · answered Apr 29 '11 at 01:58

5

You'll need to work with the underlying libraries - Tika doesn't do anything at the page level.

For PDF files, PDFBox should be able to give you some page stuff. For Word, HWPF and XWPF from Apache POI don't really do page level things - the page breaks aren't stored in the file, but instead need to be calculated on the fly based on the text + fonts + page size...

answered Apr 29 '11 at 01:58

Gagravarr

47,320
10
111
156

So while Tika uses PDFBox under the hood, it doesn't provide the same breadth of functionality that PDFBox does? I'm especially conerned that from what I see Tika doesn't allow you to set start - end pages the way PDFBox allows you -- as this SO thread demonstrate http://stackoverflow.com/questions/6839787/reading-a-particular-page-from-a-pdf-document-using-pdfbox – Don Cheadle Oct 08 '14 at 14:28
1

Apache Tika provides common functionality across a very wide range of file formats. It'll never expose everything that each library does, instead it makes life simple and consistent – Gagravarr Oct 08 '14 at 14:33
So if I want to get page-by-page action with PDF's and the such, Tika won't get me there, and I basically should use PDFBox? – Don Cheadle Oct 08 '14 at 14:35

Is it possible to extract text by page for word/pdf files using Apache Tika?

3 Answers3

Linked