0

I'm reading a pdf file by iTextSharp but the following command does not return the TAB character, only the ENTER.

var rect = new System.util.RectangleJ(x, y, width, height);
var filters = new RenderFilter[1];
filters[0] = new RegionTextRenderFilter(rect);
ITextExtractionStrategy strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), filters);
var currentText = PdfTextExtractor.GetTextFromPage(pdfReader, pageNumber, strategy);

Can someone help me?

thank you

Marco Araujo
  • 165
  • 1
  • 12

1 Answers1

1

Nobody can answer your question because your assumption that the concept of a TAB character in a PDF content stream exists is wrong.

There is no such thing as a TAB character between two words. TABs are created by defining distances between words. Text is added at absolute positions and if two snippets of text need to be separated by tab space, the coordinates are adapted in accordance with this requirement. There are no TAB characters! Only differences in distances between text snippets.

iTextSharp can give you detailed information about the position of text snippets that are stored inside a PDF. You can find some code in the accepted answer to this question: PDF Reading highlighed text (highlight annotations) using C#

We've demonstrated the concept of text extraction at our iText Summit in Cologne on June 17, 2014. These are the slides that will help you on your way: http://www.slideshare.net/iTextPDF/itext-summit-2014-talk-unstructured-pdf

Community
  • 1
  • 1
Bruno Lowagie
  • 75,994
  • 9
  • 109
  • 165
  • Hello Bruno, thanks for the reply, I will analyze. Following this same situation, I can not differentiate a breach of the limit of columns and a line break intentional line as a header for example. Could you guide me how I can get a differentiation between these two situations? – Marco Araujo Jun 24 '14 at 04:54
  • You have identified the core problem of PDF parsing. I've seen some solutions at the PDF days last week and they all involved human intervention (it can't be done automatically, you need manual, visual operations). – Bruno Lowagie Jun 24 '14 at 06:11