0

I'm trying to use iText PDFSweep RegexBasedCleanupStrategy to redact some words from pdf, however I only want to redact the word but not appear in other word, eg. I want to redact "al" as single word, but I don't want to redact the "al" in "mineral". So I add the word boundary("\b") in the Regex as parameter to RegexBasedCleanupStrategy,

  new RegexBasedCleanupStrategy("\\bal\\b")

however the pdfAutoSweep.cleanUp not work if the word is at the end of line.

mkl
  • 90,588
  • 15
  • 125
  • 265
J Zou
  • 108
  • 6
  • **A** You claim *however the pdfAutoSweep.cleanUp not work* - what do you mean by that? Does `cleanUp` not redact there at all? Or does it redact something wrong? **B** The problem might be a matter of regular expression interpretation. Thus, I'd propose you add the tag [tag:regex]. – mkl Oct 06 '18 at 21:02
  • I mean when the word is at the end of line, the clean up didn't redact anything. if the word I want to redact is the middle of the line, the cleanup will redact it properly. – J Zou Oct 07 '18 at 17:51
  • Ok, I could indeed reproduce the issue easily. – mkl Oct 08 '18 at 13:48

1 Answers1

1

In short

The cause of this issue is that the routine that flattens the extracted text chunks into a single String for applying the regular expression does not insert any indicator for a line break. Thus, in that String the last letter from one line is immediately followed by the first letter of the next which hides the word boundary. One can fix the behavior by adding an appropriate character to the String in case of a line break.

The problematic code

The routine that flattens the extracted text chunks into a single String is CharacterRenderInfo.mapString(List<CharacterRenderInfo>) in the package com.itextpdf.kernel.pdf.canvas.parser.listener. In case of a merely horizontal gap this routine inserts a space character but in case of a vertical offset, i.e. a line break, it adds nothing extra to the StringBuilder in which the String representation is generated:

if (chunk.sameLine(lastChunk)) {
    // we only insert a blank space if the trailing character of the previous string wasn't a space, and the leading character of the current string isn't a space
    if (chunk.getLocation().isAtWordBoundary(lastChunk.getLocation()) && !chunk.getText().startsWith(" ") && !chunk.getText().endsWith(" ")) {
        sb.append(' ');
    }
    indexMap.put(sb.length(), i);
    sb.append(chunk.getText());
} else {
    indexMap.put(sb.length(), i);
    sb.append(chunk.getText());
}

A possible fix

One can extend the code above to insert a newline character in case of a line break:

if (chunk.sameLine(lastChunk)) {
    // we only insert a blank space if the trailing character of the previous string wasn't a space, and the leading character of the current string isn't a space
    if (chunk.getLocation().isAtWordBoundary(lastChunk.getLocation()) && !chunk.getText().startsWith(" ") && !chunk.getText().endsWith(" ")) {
        sb.append(' ');
    }
    indexMap.put(sb.length(), i);
    sb.append(chunk.getText());
} else {
    sb.append('\n');
    indexMap.put(sb.length(), i);
    sb.append(chunk.getText());
}

This CharacterRenderInfo.mapString method is only called from the RegexBasedLocationExtractionStrategy method getResultantLocations() (package com.itextpdf.kernel.pdf.canvas.parser.listener), and only for the task mentioned, i.e. applying the regular expression in question. Thus, enabling it to properly allow recognition of word boundaries should not break anything but indeed should be considered a fix.

One merely might consider adding a different character for a line break, e.g. a plain space ' ' if one does not want to treat vertical gaps any different than horizontal ones. For a general fix one might, therefore, consider making this character a settable property of the strategy.

Versions

I tested with iText 7.1.4-SNAPSHOT and PDFSweep 2.0.3-SNAPSHOT.

mkl
  • 90,588
  • 15
  • 125
  • 265
  • hi mkl, is this patch in 7.1.5 snapshot? – J Zou Oct 15 '18 at 18:19
  • the patch works for this issue; a little further information, looks like it's also related to the converted pdf, I use wkhtmltopdf convert one html to pdf, the issue occur, however if I use other software to convert, no issue. – J Zou Oct 15 '18 at 21:31
  • *"it's also related to the converted pdf"* - that may well be, some of generators explicitly draw a space character at the end of a line and some don't. If there is such a space character, already the original itext code matches word boundaries at the ends of lines. – mkl Oct 16 '18 at 05:54