2

I want to extract Tables in a PDF document pro-grammatically using C# for a college project. i'm quite familiar with itextsharp.

  1. Is there a way i can extract tables in itextsharp ?

  2. Is there any other free library i can use for this purpose ?

  3. Can i convert the PDF to XML/HTML in order to extract <table> tags, if so is there a free library i can use for PDF to HTML conversion ?

    or

please give me a suitable solution for this..

Community
  • 1
  • 1

1 Answers1

0

Can you try something like this and extend what you need from this example I converted from VB.Net to the C# equiv

public static string GetTextFromPDF(string PdfFileName)
{
    iTextSharp.text.pdf.PdfReader pdfReader = new iTextSharp.text.pdf.PdfReader(PdfFileName);
    dynamic sOut = string.Empty;

    for (i = 1; i <= pdfReader.NumberOfPages; i++) {
        iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy its = new iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy();
        sOut += iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(pdfReader, i, its);
    }
    return sOut;
}
MethodMan
  • 18,625
  • 6
  • 34
  • 52
  • 4
    textextractionstrategy is for extracting text from the pdf document.yes the code works fine for that purpose,but i need to extact the table.how can i identify the text inside the table except other paragraph text ? – Buddhima Naween Rathnayake Aug 20 '14 at 16:37