0

I am trying to extract the headlines of some pdf files to sort them. Unfortunately there's a space between every letters with the spaces between words bigger than the ones between letters of the same word. Here's my extraction method:

PdfReader reader = new PdfReader(filename);
Rectangle rect = new Rectangle(0, 0, 1000, 1000);
RenderFilter regionFilter = new RegionTextRenderFilter(rect);
FontRenderFilter fontFilter = new FontRenderFilter();
FilteredTextRenderListener strategy = new FilteredTextRenderListener(
    new LocationTextExtractionStrategy(), regionFilter, fontFilter);
string result = PdfTextExtractor.GetTextFromPage(reader, 1, strategy);
reader.Close();

Is there a way to filter out the smaller spaces?

derBasti
  • 325
  • 3
  • 9
  • 1
    Do those spaces correspond to actual drawn space glyphs or are they produced from insertion point moves? If you don't know, please supply a sample PDF illustrating your issue. – mkl Mar 22 '15 at 22:38
  • Unfortunately I can't show you a sample pdf because I have no rights to do so. Can you please tell me how I can determine which kind of spaces they are? – derBasti Mar 23 '15 at 10:20
  • *Can you please tell me how I can determine which kind of spaces they are* - Use a PDF browser (e.g. RUPS) and inspect the respective page or xobject streams. Some thorough understanding of the [PDF specification](http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/PDF32000_2008.pdf) is required. – mkl Mar 23 '15 at 11:13
  • [This answer](http://stackoverflow.com/a/13645183/1729265) and [this one](http://stackoverflow.com/a/20049810/1729265) deal with a probably related issue: in their case gaps between characters were too small to be recognized as space. If your spaces are derived from insertion point moves, those answer may inspire you. – mkl Mar 23 '15 at 11:17
  • As far as I can see, there are actual drawn spaces. – derBasti Mar 24 '15 at 20:16
  • In that case, i.e. there are actual space characters, you hardly can expect text extraction not to extract the spaces, can you? – mkl Mar 24 '15 at 21:33
  • Yes there are spaces, but in different sizes. "Real" spaces are wider than the spaces I don't want to extract. – derBasti Mar 24 '15 at 21:36
  • I'm afraid this doesn't get us anywhere, the information is too indistinct to work on. Unless you share the PDF or at least relevant excerpts from the content streams and resources showing the operations behind both the spaces you want removed and those you don't want removed, I don't see a way to help. – mkl Mar 25 '15 at 08:12

1 Answers1

2

iText uses the distance of the rendered glyphs as base to decide if a space is present or not. The general rule applied is, if the distance is larger than the width of a normal space, devided by 2, than a space character is recognized. While this works quite well in most cases, it doesn't work at all, if the width of a space character could not be determined for the font used. In my case the width of a space was recognized as 0, thus the smallest distance between glyphs was recognized as a space. I based my solution on another answer from mkl to a question that is very similar to yours.

In short: You need to derive from e.g. SimpleTextExtractionStrategy or LocationTextExtractionStrategy and override the methods that convert the distance between glyphs into spaces (renderText or isChunkAtWordBoundary respectively).

You can also refer to the answer I gave here or the original solution by mkl.

Community
  • 1
  • 1
tom_imk
  • 165
  • 2
  • 11
  • The OP indicated that there may actually be space characters between the letters. In that case changing the parameters you refer to will not help. (Probably, though, the OP is wrong and really merely needs some word boundary detection fine tuning...) – mkl Dec 22 '15 at 12:59