ITextSharp: Extract text without small spaces

Question

I am trying to extract the headlines of some pdf files to sort them. Unfortunately there's a space between every letters with the spaces between words bigger than the ones between letters of the same word. Here's my extraction method:

PdfReader reader = new PdfReader(filename);
Rectangle rect = new Rectangle(0, 0, 1000, 1000);
RenderFilter regionFilter = new RegionTextRenderFilter(rect);
FontRenderFilter fontFilter = new FontRenderFilter();
FilteredTextRenderListener strategy = new FilteredTextRenderListener(
    new LocationTextExtractionStrategy(), regionFilter, fontFilter);
string result = PdfTextExtractor.GetTextFromPage(reader, 1, strategy);
reader.Close();

Is there a way to filter out the smaller spaces?

Do those spaces correspond to actual drawn space glyphs or are they produced from insertion point moves? If you don't know, please supply a sample PDF illustrating your issue. — mkl, Mar 22 '15 at 22:38
Unfortunately I can't show you a sample pdf because I have no rights to do so. Can you please tell me how I can determine which kind of spaces they are? — derBasti, Mar 23 '15 at 10:20
*Can you please tell me how I can determine which kind of spaces they are* - Use a PDF browser (e.g. RUPS) and inspect the respective page or xobject streams. Some thorough understanding of the [PDF specification](http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/PDF32000_2008.pdf) is required. — mkl, Mar 23 '15 at 11:13
[This answer](http://stackoverflow.com/a/13645183/1729265) and [this one](http://stackoverflow.com/a/20049810/1729265) deal with a probably related issue: in their case gaps between characters were too small to be recognized as space. If your spaces are derived from insertion point moves, those answer may inspire you. — mkl, Mar 23 '15 at 11:17
In that case, i.e. there are actual space characters, you hardly can expect text extraction not to extract the spaces, can you? — mkl, Mar 24 '15 at 21:33
Yes there are spaces, but in different sizes. "Real" spaces are wider than the spaces I don't want to extract. — derBasti, Mar 24 '15 at 21:36
I'm afraid this doesn't get us anywhere, the information is too indistinct to work on. Unless you share the PDF or at least relevant excerpts from the content streams and resources showing the operations behind both the spaces you want removed and those you don't want removed, I don't see a way to help. — mkl, Mar 25 '15 at 08:12

score 2 · Answer 1 · edited May 23 '17 at 10:28

iText uses the distance of the rendered glyphs as base to decide if a space is present or not. The general rule applied is, if the distance is larger than the width of a normal space, devided by 2, than a space character is recognized. While this works quite well in most cases, it doesn't work at all, if the width of a space character could not be determined for the font used. In my case the width of a space was recognized as 0, thus the smallest distance between glyphs was recognized as a space. I based my solution on another answer from mkl to a question that is very similar to yours.

In short: You need to derive from e.g. SimpleTextExtractionStrategy or LocationTextExtractionStrategy and override the methods that convert the distance between glyphs into spaces (renderText or isChunkAtWordBoundary respectively).

You can also refer to the answer I gave here or the original solution by mkl.

The OP indicated that there may actually be space characters between the letters. In that case changing the parameters you refer to will not help. (Probably, though, the OP is wrong and really merely needs some word boundary detection fine tuning...) — mkl, Dec 22 '15 at 12:59

ITextSharp: Extract text without small spaces

1 Answers1