I am currently working in extracting text from pdf. my current issue is in distinguishing the headings and sub-headings from the extracted text. I am working with iTextSharp and using the bold text information to detect the heading. The font size cannot be trusted all the time. also tried with PDFBox.
1)I would like to know is there any method to identify headings and sub-headings from PDF.
2)Is adobe or pdfExchange editor provide any API for the same?
For example:
I need to extract
"Tourism in 2040: Bringing an additional one million visitors per year to paradise" as heading
"Executive Summary" as sub-heading
Even though this can be extracted using bold text info, it failed in a lot of cases. That's why looking for APIs.