Extracting Tables from PDF document

Question

I want to extract Tables in a PDF document pro-grammatically using C# for a college project. i'm quite familiar with itextsharp.

Is there a way i can extract tables in itextsharp ?
Is there any other free library i can use for this purpose ?
Can i convert the PDF to XML/HTML in order to extract <table> tags, if so is there a free library i can use for PDF to HTML conversion ?

or

please give me a suitable solution for this..

have you looked at `ITextSharp's` documentation/examples they have on their site..? — MethodMan, Aug 20 '14 at 16:17
yes, so far i couldn't find a way to do this in itextsharp because tables are mostly text data.we can't differentiate table data from text data in itextsharp. — Buddhima Naween Rathnayake, Aug 20 '14 at 16:21

score 0 · Answer 1 · answered Aug 20 '14 at 16:29

Can you try something like this and extend what you need from this example I converted from VB.Net to the C# equiv

public static string GetTextFromPDF(string PdfFileName)
{
    iTextSharp.text.pdf.PdfReader pdfReader = new iTextSharp.text.pdf.PdfReader(PdfFileName);
    dynamic sOut = string.Empty;

    for (i = 1; i <= pdfReader.NumberOfPages; i++) {
        iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy its = new iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy();
        sOut += iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(pdfReader, i, its);
    }
    return sOut;
}

textextractionstrategy is for extracting text from the pdf document.yes the code works fine for that purpose,but i need to extact the table.how can i identify the text inside the table except other paragraph text ? — Buddhima Naween Rathnayake, Aug 20 '14 at 16:37

Extracting Tables from PDF document

1 Answers1