2

My goal is to retrieve data from PDF which may be in table structure to an excel file.

using LocationTextExtractionStrategy with iTextSharp we can get the string data in plain text with page content in left to right manner.

How can I move forward such that during

PdfTextExtractor.GetTextFromPage(reader, i, new LocationTextExtractionStrategy())

I could make the text retain its coordinate in the resulting string.

As for instance if the first line in the pdf has text aligned to right, then the resulting string must be containing trailing space or spaces keeping the content right aligned.

Please give some suggestions, how I may proceed to achieve the same.

Vinay
  • 471
  • 3
  • 8
  • 19

1 Answers1

10

Its very important to understand that PDFs have no support for tables. Anything that looks like a table is really just a bunch of text placed at specific locations over a background of lines. This is very important and you need to keep this in mind as you work on this.

That said, you need to subclass TextExtractionStrategy and pass that into GetTextFromPage(). See this post for a simple example of that. Then see this post for a more complex example of subclassing. The latter isn't completely relevant to your goal but it does show some more complex things that you can do.

Community
  • 1
  • 1
Chris Haas
  • 53,986
  • 12
  • 141
  • 274
  • Thanks @Chris for the solution. I am going to subclass it. – Vinay Sep 23 '11 at 05:18
  • after subclassing it as `TextChunk location = new TextChunk(info.GetText(), bottomleft, topRight, info.GetSingleSpaceWidth()); locationalResult.Add(location);` and calling it as `PdfTextExtractor.GetTextFromPage(reader, i, strategy` I am not getting the text in desired manner. Can you help me out where am I getting it wrong. – Vinay Sep 23 '11 at 11:54
  • I could finally extract out the text with positions from the PDF.Thanks for the help.These days I was trying to put them as in table structure for excelfile, but till date I am unable to get a suitable dll or solution which would help me placing content in the excel file. Though I am thinking of creating and using excel template but presently I am having the text data as in dataview / datatable with text and postion information. – Vinay Oct 11 '11 at 06:59
  • EPPlus.dll and NPOI.dll (that's "npoi") are two DLLs that can read/write Excel .xlsx files. NPOI.dll can read/write Excel "BIFF" (.xls) files, too. – user1390375 Aug 08 '20 at 00:55