1

I'm having trouble reading a PDF with header and footer but with 2 columns in your body.

I already have the column widths and height of the header but I need the code to read the pages with columns.

Can anyone provide me a piece of code that reads PDF with columns?

thank you

Marco Araujo
  • 165
  • 1
  • 12

1 Answers1

1

It's very hard to achieve what you want if you don't know the position of the columns, but I assume that you have its coordinates because you say "I already have the column widths and height". In that case, your question isn't that different from this other question posted on StackOverflow: iTextSharp read from specific position

Suppose that rect is a Rectangle corresponding with the position of a column, then you need this code:

RenderFilter[] filter = {new RegionTextRenderFilter(rect)};
ITextExtractionStrategy strategy = new FilteredTextRenderListener(
    new LocationTextExtractionStrategy(), filter);
String single_column = PdfTextExtractor.GetTextFromPage(reader, i, strategy));

Now you have the text in a single column. You need to repeat this for every column on your page.

Extra comment: While in most cases using the RegionTextRenderFilter will work just fine, a few cases (in which columns are created by simply inserting additional space characters in the lines) might require to split the text chunks to process in advance. This can be done e.g. by using the TextRenderInfoSplitter from this answer and wrapping the FilteredTextRenderListener in it. (This comment was provided by mkl.)

Community
  • 1
  • 1
Bruno Lowagie
  • 75,994
  • 9
  • 109
  • 165
  • 1
    While in most cases using the `RegionTextRenderFilter` will work just fine, a few cases (in which columns are created by simply inserting additional space characters in the lines) might require to split the text chunks to process in advance. This can be done e.g. by using the `TextRenderInfoSplitter` from [this answer](http://stackoverflow.com/questions/21000256/pdf-reading-highlighed-text-highlight-annotations-using-c-sharp/21023311#21023311) and wrapping the `FilteredTextRenderListener` in it. – mkl Jun 16 '14 at 07:31
  • Good remark, mkl, I'll add your comment to the answer. – Bruno Lowagie Jun 16 '14 at 14:16
  • Thanks Bruno, But this strategy is altering the text of doubles \n for a single \n. And I need the double of \n as is done in SimpleTextExtractionStrategy() You know how I can use rectangles without losing the double \n ? – Marco Araujo Jun 18 '14 at 15:35
  • Your assumption that there's something like `\n` in a PDF file is wrong. There are lines with a different leading (that is a different space between the baseline of two lines of text). There is *no such thing as* `\n` in a text string in a PDF. This makes your question invalid. (If you think differently, please give me the section of ISO-32000-1 where it's explained.) – Bruno Lowagie Jun 18 '14 at 16:28
  • Thanks Bruno, you're right. But I noticed that the tab characters (tab key) Healthy unrecovered PDF. Can you tell me how well I recover the TAB through the "PdfTextExtractor.GetTextFromPage (PdfReader, pageNumber, strategy);" command ? – Marco Araujo Jun 21 '14 at 19:55