iText PDFSweep RegexBasedCleanupStrategy not work in some case

Question

I'm trying to use iText PDFSweep RegexBasedCleanupStrategy to redact some words from pdf, however I only want to redact the word but not appear in other word, eg. I want to redact "al" as single word, but I don't want to redact the "al" in "mineral". So I add the word boundary("\b") in the Regex as parameter to RegexBasedCleanupStrategy,

  new RegexBasedCleanupStrategy("\\bal\\b")

however the pdfAutoSweep.cleanUp not work if the word is at the end of line.

**A** You claim *however the pdfAutoSweep.cleanUp not work* - what do you mean by that? Does `cleanUp` not redact there at all? Or does it redact something wrong? **B** The problem might be a matter of regular expression interpretation. Thus, I'd propose you add the tag [tag:regex]. — mkl, Oct 06 '18 at 21:02
I mean when the word is at the end of line, the clean up didn't redact anything. if the word I want to redact is the middle of the line, the cleanup will redact it properly. — J Zou, Oct 07 '18 at 17:51

mkl · Accepted Answer · 2018-10-08T14:32:58.753

In short

The cause of this issue is that the routine that flattens the extracted text chunks into a single String for applying the regular expression does not insert any indicator for a line break. Thus, in that String the last letter from one line is immediately followed by the first letter of the next which hides the word boundary. One can fix the behavior by adding an appropriate character to the String in case of a line break.

The problematic code

The routine that flattens the extracted text chunks into a single String is CharacterRenderInfo.mapString(List<CharacterRenderInfo>) in the package com.itextpdf.kernel.pdf.canvas.parser.listener. In case of a merely horizontal gap this routine inserts a space character but in case of a vertical offset, i.e. a line break, it adds nothing extra to the StringBuilder in which the String representation is generated:

if (chunk.sameLine(lastChunk)) {
    // we only insert a blank space if the trailing character of the previous string wasn't a space, and the leading character of the current string isn't a space
    if (chunk.getLocation().isAtWordBoundary(lastChunk.getLocation()) && !chunk.getText().startsWith(" ") && !chunk.getText().endsWith(" ")) {
        sb.append(' ');
    }
    indexMap.put(sb.length(), i);
    sb.append(chunk.getText());
} else {
    indexMap.put(sb.length(), i);
    sb.append(chunk.getText());
}

A possible fix

One can extend the code above to insert a newline character in case of a line break:

if (chunk.sameLine(lastChunk)) {
    // we only insert a blank space if the trailing character of the previous string wasn't a space, and the leading character of the current string isn't a space
    if (chunk.getLocation().isAtWordBoundary(lastChunk.getLocation()) && !chunk.getText().startsWith(" ") && !chunk.getText().endsWith(" ")) {
        sb.append(' ');
    }
    indexMap.put(sb.length(), i);
    sb.append(chunk.getText());
} else {
    sb.append('\n');
    indexMap.put(sb.length(), i);
    sb.append(chunk.getText());
}

This CharacterRenderInfo.mapString method is only called from the RegexBasedLocationExtractionStrategy method getResultantLocations() (package com.itextpdf.kernel.pdf.canvas.parser.listener), and only for the task mentioned, i.e. applying the regular expression in question. Thus, enabling it to properly allow recognition of word boundaries should not break anything but indeed should be considered a fix.

One merely might consider adding a different character for a line break, e.g. a plain space ' ' if one does not want to treat vertical gaps any different than horizontal ones. For a general fix one might, therefore, consider making this character a settable property of the strategy.

Versions

I tested with iText 7.1.4-SNAPSHOT and PDFSweep 2.0.3-SNAPSHOT.

the patch works for this issue; a little further information, looks like it's also related to the converted pdf, I use wkhtmltopdf convert one html to pdf, the issue occur, however if I use other software to convert, no issue. — J Zou, Oct 15 '18 at 21:31
*"it's also related to the converted pdf"* - that may well be, some of generators explicitly draw a space character at the end of a line and some don't. If there is such a space character, already the original itext code matches word boundaries at the ends of lines. — mkl, Oct 16 '18 at 05:54

iText PDFSweep RegexBasedCleanupStrategy not work in some case

1 Answers1

In short

The problematic code

A possible fix

Versions

Linked