Python - Apache Tika Single Page parser

Question

I was wondering if there is any way using Tika/Python to only parse the first page or extract the metadata from the first page only? Right now, when I pass the pdf, it is parsing every single page. I looked that this link: Is it possible to extract text by page for word/pdf files using Apache Tika? However, this link explains more in java, which I am not familiar with. I was hoping there could be a python solution for it? Thanks!

from tika import parser
# running: java -jar tika-server1.18.jar before executing code below. 
parsedPDF = parser.from_file('C:\\path\\to\\dir\\sample.pdf')

fulltext = parsedPDF['content']

metadata_dict = parsedPDF['metadata']
title = metadata_dict['title']
author = metadata_dict['Author'] # capturing all the names from lets say 15 pages. Just want it to capture from first page 
pages = metadata_dict['xmpTPg:NPages']

Short answer - nope. Best you can do is fetch the content as XHTML and then grab only the first page's div — Gagravarr, Nov 01 '18 at 12:05

Samuel Verboomen · Answer 1 · 2019-09-23T15:01:26.570

Thanks for this info, really helpful. Here is my code to retrieve the content page by page (a bit dirty, but it works) :

    raw_xml = parser.from_file(file, xmlContent=True)
    body = raw_xml['content'].split('<body>')[1].split('</body>')[0]
    body_without_tag = body.replace("<p>", "").replace("</p>", "").replace("<div>", "").replace("</div>","").replace("<p />","")
    text_pages = body_without_tag.split("""<div class="page">""")[1:]
    num_pages = len(text_pages)
    if num_pages==int(raw_xml['metadata']['xmpTPg:NPages']) : #check if it worked correctly
         return text_pages

score 4 · Accepted Answer · answered Nov 07 '18 at 18:38

@Gagravarr comments regarding XHTML, I found that Tika had a xmlContent parsing when reading the file. I used it to capture xml format and used regex to capture it.

This worked out for me:

parsed_data_full = parser.from_file(file_name,xmlContent=True) 
parsed_data_full = parsed_data_full['content']

There is a start and end for each page divider that starts with "<div" and ends with "</div>" first occurrence . Basically wrote a small code to capture sub-strings between 2 sub-strings and stored into a variable to my specific requirement.

Thanks for this answer. However, this seems to only work for .pdf files, and not .docx files. The parsed XHTML of a .docx file does not contain `
` tags. Do you know of any way to handle this? — GoodDeeds, Apr 26 '19 at 21:46

Python - Apache Tika Single Page parser

2 Answers2