How to read PDF/DOCX page by page using tika library in python?

Asked Jan 04 '23 at 08:07

Active Jan 04 '23 at 08:24

Viewed 110 times

`# #!/usr/bin/env python
import tika
tika.initVM()
from tika import parser

parsed = parser.from_file('frank_diary.docx')
print(parsed["metadata"])
print(parsed["content"])`

From this code i am able to read whole file but not page by page.

Ref. I go through this reference but it's not working. Is there any way to read PDF/DOCX using tike page by page?

I expect to read PDF/DOCX using tika page by page.

Example: Dict = [{"page_number" : 1, "content":"content"},{"page_number" : 2, "content":"content"},{"page_number" : 3, "content":"content"}]

edited Jan 04 '23 at 08:24

asked Jan 04 '23 at 08:07

dilip kukadiya

How to read PDF/DOCX page by page using tika library in python?

0 Answers0