Strange whitespaces when parsing a PDF

Question

I need to parse a PDF document. I already implemented the parser and used the Library iText and till now it worked without any problems.

But no I need to parse another document which gets very strange whitespaces in the middle of words. As example I get:

Vo rber eitung auf die Motorr adsaison. Viele Motorr adf ahr er

All the bold words should be connected, but somehow the PDF Parser is adding whitespaces into the words. But when I copy and paste the content from the PDF into a Textfile I dont get these spaces.

First I thought it's because of the PDF Parsing library I'm using, but also with another library I get the exact same issue.

I had a look on the singleSpaceWidth from the parsed words and I noticed that it's varying always then, when it's adding a whitespace. I tried to put them manually together. But since there isn't really a pattern to recombine the words it's almost impossible.

Did anyone else have a similar issue or even a solution to that problem?

As requested, here is some more information:

iText Version 5.2.1
http://prine.ch/whitespacesProblem.pdf (Link to the pdf)

Parsing with SemTextExtractionStrategy:

PdfReader reader = new PdfReader("data/SpecialTests/SuedostSchweiz/" + src);

SemTextExtractionStrategy semTextExtractionStrategy = new SemTextExtractionStrategy();

for (int i = 1; i <= reader.getNumberOfPages(); i++) {
    // Set the page number on the strategy. Is used in the Parsing strategies.
    semTextExtractionStrategy.pageNumber = i;

    // Parse text from page
    PdfTextExtractor.getTextFromPage(reader, i, semTextExtractionStrategy);
}

Here the SemTextExtractionStrategy method which actually parses the text. There I manually add after every parsed word a whitespace, but somehow it does split the words in the detection:

@Override
public void parseText(TextRenderInfo renderInfo, int pageNumber) {      

    this.pageNumber = pageNumber;

    String text = renderInfo.getText();

    currTextBlock.getText().append(text + " ");

    ....
}

Here is the whole SemTextExtraction Class but in there it does only call the method from above (parseText):

public class SemTextExtractionStrategy implements TextExtractionStrategy {

    // Text Extraction Strategies
    public ColumnDetecter columnDetecter = new ColumnDetecter();

    // Image Extraction Strategies
    public ImageRetriever imageRetriever = new ImageRetriever();

    public int pageNumber = -1;

    public ArrayList<TextParsingStrategy> textParsingStrategies = new ArrayList<TextParsingStrategy>();
    public ArrayList<ImageParsingStrategy> imageParsingStrategies = new ArrayList<ImageParsingStrategy>();

    public SemTextExtractionStrategy() {

        // Add all text parsing strategies which are later on applied on the extracted text
        // textParsingStrategies.add(fontSizeMatcher);
        textParsingStrategies.add(columnDetecter);

        // Add all image parsing strategies which are later on applied on the extracted text
        imageParsingStrategies.add(imageRetriever);
    }

    @Override
    public void beginTextBlock() {

    }

    @Override
    public void renderText(TextRenderInfo renderInfo) {
        // TEXT PARSING
        for(TextParsingStrategy strategy : textParsingStrategies) {
            strategy.parseText(renderInfo, pageNumber);
        }
    }

    @Override
    public void endTextBlock() {

    }

    @Override
    public void renderImage(ImageRenderInfo renderInfo) {
        for(ImageParsingStrategy strategy : imageParsingStrategies) {
            strategy.parseImage(renderInfo);
        }
    }
}

please tell the version of iText you are using and somehow you need to provide the PDF also, and the code that you do parsing with. — Eugene, Aug 10 '12 at 12:41
@tobaiasjl Thats a looong time ago.. But I kind of have in my back head that the PDF was corrupted and with a newer generated PDF the problem didn't occur.. — Prine, Sep 29 '15 at 14:04
@NinjaOnSafari If I remember correctly it was originally a Word Doc and we recreate it with another word version.. But not 100% sure, that was 3 years ago ;) — Prine, Sep 29 '15 at 14:12
@Prine hmm bummer i cannot do that... do you know what the guy below used to generate the pdf? — NinjaOnSafari, Sep 29 '15 at 14:14
@NinjaOnSafari Well he used the "gs" command in the terminal, but you have to ask him for more details... — Prine, Sep 29 '15 at 14:16

score 6 · Answer 1 · edited Jun 20 '20 at 09:12

6

The whitespaces in pdf are a known issue as described by the answer on here by Roland and also seen at first comment of https://issues.apache.org/jira/browse/TIKA-724

The answer that also worked for me is the one seen by huuhungus at https://github.com/smalot/pdfparser/issues/72

which is specific to PDFParser and it is to change the code that actually adds this extra space to the PDFParser if you know you will have this problem:

src/Smalot/PdfParser/Object.php comment out this line
   $text .= ' ';
Not completely fix it, but it's at acceptable

Other libraries may also have similar temporary fixes so they could help with this issue in some cases.

edited Jun 20 '20 at 09:12

Community

1
1

answered Jan 05 '17 at 12:52

user3134164

171
2
4

iText 5.2.1 is an ancient version now. Current versions have got properties / overridable methods to fine tune in which situations iText adds a space and in which not. Never adding a space also is a bad choice in general, numerous PDFs then will their text extracted with hardly any spaces at all. – mkl Jan 05 '17 at 16:46

Roland Illig · Accepted Answer · 2015-09-30T17:19:11.063

I have processed the given PDF file with the following Ghostscript command:

gs -o out.pdf -q -sDEVICE=pdfwrite -dOptimize=false -dUseFlageCompression=false -dCompressPages=false -dCompressFonts=false whitespacesProblem.pdf

This command created a file out.pdf, which does not have the stream encodings, so it is better readable. The interesting part is in line 52, which I split into multiple lines for readability:

[
  (&;&)-287.988
  (672744)29.9906
  (+\(%)30.01
  (+!4)29.9876
  (&4)-287.989
  (%4)30.0039
  (&1&8)-287.975
  (3=\)!)-288.021
  (*&4)30.0212
  (&=23)-287.996
  (+1%)-287.99
  (\(=&)-288.011
  (8&1&)-287.974
  (672744)29.9906
  (+\(3+=378$)-250.977
  (#7\)!)
]TJ

Between the parentheses are the text characters. I changed some of them and watched the rendered PDF file to see which character represents which glyph. Then I decoded the text:

[
  (ele)-287.988
  (Motorr)29.9906 ***
  (adf)30.01 ***
  (ahr)29.9876 ***
  (er)-287.989
  (fr)30.0039
  (euen)-287.975
  (sich)-288.021
  ...
]

So there is indeed whitespace between the characters. In your case this is probably the kerning of the font. The question is now how your PDF library interprets this whitespace, and it seems to me, that even "negative whitespace" is rendered into a space in the resulting string.

It’s Ghostscript; I’ve edited the answer to make that clear. Thanks for the hint. — Roland Illig, Sep 30 '15 at 17:19
There is no way to get rid of the whitespace in the PDF file, since it’s just there. I don’t know whether iText can handle this and how, since I don’t know iText. In this answer I just explained where the additional whitespace came from. — Roland Illig, Sep 30 '15 at 17:24

score 0 · Answer 3 · answered Aug 10 '12 at 13:34

0

Because the document that you have is split into columns, the obvious error is inside the

SemTextExtractionStrategy

class. I assume that the class ColumnDetecter is the one to be blamed probably and not iText. I can only assume that it is implemented based on the size of the column, then retrieves the text based on that.

If you want just the text, then the implementation could be simpler, based on the size of the Column.

answered Aug 10 '12 at 13:34

Eugene

117,005
15
201
306

Thanks for your answer. I will definitely have a look into the ColumnDetecter. But the parseText method is from this class and there I get the output directly from the iText library where the words are already splitted.. – Prine Aug 10 '12 at 14:20

Strange whitespaces when parsing a PDF

3 Answers3