0

Wish you all peaceful Happy New Year!!

I am very new to reading PDF content with images, text and table. I have gone through many sites and i used iTextSharp (TextWithFontExtractionStategy) to read and convert the content to HTML and i did only with text. I searched many sites for suggestions but i could not to find a solution.

Now, what i want to achieve is i would like read content from PDF that contains Text images and table and converting it to HTML. I got to know that it's difficult to identify a image and table.

For Image - I don't want to extract an image from PDF as i am going to keep some place holder for an image so that i can give some alternate text. Is it possible to identify an image when reading PDF content if there is an image?. Because iTextShrp(TextWithFontExtractionStategy) is skipping the image and reading the next item.

For Table - I want to read the table as how it is present in the PDF.

These all conversion should be given in one result.

If anyone help me it would be greatly appreciated!!.

Thanks a lot in advance!!

Prabhu
  • 1
  • 1

1 Answers1

0

Considering you'll develop with .NET, you can use PDFSharp library.

Capturing Images

There is an excellent SO answer related to retrieval of table data with respect to PDF specification.

Ozan Gunceler
  • 1,067
  • 11
  • 20