heading and sub-heading extraction from PDF

Asked Oct 29 '18 at 10:16

Active Oct 29 '18 at 15:21

Viewed 686 times

I am currently working in extracting text from pdf. my current issue is in distinguishing the headings and sub-headings from the extracted text. I am working with iTextSharp and using the bold text information to detect the heading. The font size cannot be trusted all the time. also tried with PDFBox.

1)I would like to know is there any method to identify headings and sub-headings from PDF.

2)Is adobe or pdfExchange editor provide any API for the same?

For example:

sample pdf image is given

I need to extract

"Tourism in 2040: Bringing an additional one million visitors per year to paradise" as heading

"Executive Summary" as sub-heading

Even though this can be extracted using bold text info, it failed in a lot of cases. That's why looking for APIs.

edited Oct 29 '18 at 15:21

mkl

90,588
15
125
265

asked Oct 29 '18 at 10:16

Aneesh Krishnan

*"Even though this can be extracted using bold text info, it failed in a lot of cases."* - then you should start by analyzing those other cases, finding strategies to extract heading and subheadings in all of them and finding ways to determine which strategies to apply to which documents. – mkl Oct 29 '18 at 15:24

heading and sub-heading extraction from PDF

0 Answers0

Linked