I need to parse a PDF document. I already implemented the parser and used the Library iText and till now it worked without any problems.
But no I need to parse another document which gets very strange whitespaces in the middle of words. As example I get:
Vo rber eitung auf die Motorr adsaison. Viele Motorr adf ahr er
All the bold words should be connected, but somehow the PDF Parser is adding whitespaces into the words. But when I copy and paste the content from the PDF into a Textfile I dont get these spaces.
First I thought it's because of the PDF Parsing library I'm using, but also with another library I get the exact same issue.
I had a look on the singleSpaceWidth
from the parsed words and I noticed that it's varying always then, when it's adding a whitespace. I tried to put them manually together. But since there isn't really a pattern to recombine the words it's almost impossible.
Did anyone else have a similar issue or even a solution to that problem?
As requested, here is some more information:
- iText Version 5.2.1
- http://prine.ch/whitespacesProblem.pdf (Link to the pdf)
Parsing with SemTextExtractionStrategy:
PdfReader reader = new PdfReader("data/SpecialTests/SuedostSchweiz/" + src);
SemTextExtractionStrategy semTextExtractionStrategy = new SemTextExtractionStrategy();
for (int i = 1; i <= reader.getNumberOfPages(); i++) {
// Set the page number on the strategy. Is used in the Parsing strategies.
semTextExtractionStrategy.pageNumber = i;
// Parse text from page
PdfTextExtractor.getTextFromPage(reader, i, semTextExtractionStrategy);
}
Here the SemTextExtractionStrategy method which actually parses the text. There I manually add after every parsed word a whitespace, but somehow it does split the words in the detection:
@Override
public void parseText(TextRenderInfo renderInfo, int pageNumber) {
this.pageNumber = pageNumber;
String text = renderInfo.getText();
currTextBlock.getText().append(text + " ");
....
}
Here is the whole SemTextExtraction Class but in there it does only call the method from above (parseText):
public class SemTextExtractionStrategy implements TextExtractionStrategy {
// Text Extraction Strategies
public ColumnDetecter columnDetecter = new ColumnDetecter();
// Image Extraction Strategies
public ImageRetriever imageRetriever = new ImageRetriever();
public int pageNumber = -1;
public ArrayList<TextParsingStrategy> textParsingStrategies = new ArrayList<TextParsingStrategy>();
public ArrayList<ImageParsingStrategy> imageParsingStrategies = new ArrayList<ImageParsingStrategy>();
public SemTextExtractionStrategy() {
// Add all text parsing strategies which are later on applied on the extracted text
// textParsingStrategies.add(fontSizeMatcher);
textParsingStrategies.add(columnDetecter);
// Add all image parsing strategies which are later on applied on the extracted text
imageParsingStrategies.add(imageRetriever);
}
@Override
public void beginTextBlock() {
}
@Override
public void renderText(TextRenderInfo renderInfo) {
// TEXT PARSING
for(TextParsingStrategy strategy : textParsingStrategies) {
strategy.parseText(renderInfo, pageNumber);
}
}
@Override
public void endTextBlock() {
}
@Override
public void renderImage(ImageRenderInfo renderInfo) {
for(ImageParsingStrategy strategy : imageParsingStrategies) {
strategy.parseImage(renderInfo);
}
}
}