PDF text extraction - Get dictionary of Heading as Key and Paragraph as Value

Asked Sep 23 '19 at 08:08

Active Sep 23 '19 at 08:18

Viewed 309 times

The same question has been asked here and here But I couldn't find a way to extract only the headings from the pdf file. Let's say a pdf file has been generated from a word document, which has structured headings and paragraphs written inside it. Now, what I'd like to do is to extract all the headings along with its paragraphs written inside it in a dictionary form.

Is there any way I can achieve this functionality in python, if yes, would appreciate an initial guide. Thank you

asked Sep 23 '19 at 08:08

Ali Asad

1,235
1
18
33

1

Word exported PDFs usually are tagged. Thus, you probably should look for Python PDF libraries which can extract text with tags and structure. (I'm not using Python myself and don't know which Python PDF libraries can do that.) – mkl Sep 23 '19 at 10:19
Awesome man, I was thinking the same. Infact I did a demo just now. Just wondering do you know any such library that could help me get those tags. Thanks – Ali Asad Sep 23 '19 at 10:31
1

As mentioned above, I'm not using Python myself. I have an idea how to do that using iText (Java/.Net) or PDFBox (Java) but there surely are many other libraries that support that, too, and I'm just not aware of them. – mkl Sep 23 '19 at 11:09
Anyway, thanks mate. appreciate it. – Ali Asad Sep 23 '19 at 11:22

PDF text extraction - Get dictionary of Heading as Key and Paragraph as Value

0 Answers0