Removing Text based watermarks using itextsharp

Question

According to this post (Removing Watermark from PDF iTextSharp) , @mkl code works fine for ExGstate graphical watermarks but I have tested this code to remove watermark from some files which have Text based watermarks behind PDF contents (like this file : http://s000.tinyupload.com/index.php?file_id=05961025831018336372) I have tried multiple solutions that found in this site but get no success. Can anyone help to remove this watermark types by changing above @mkl solution?

thanks

What you are calling a "watermark" is really just text. True, it looks different than all of the other text on the page but it is still just regular text. Check out [this](http://stackoverflow.com/q/20176614/231316), [this](http://stackoverflow.com/q/12674195/231316) or possibly [this](http://stackoverflow.com/a/17718641/231316). — Chris Haas, May 18 '16 at 13:49
@ChrisHaas but this post didn't solve the problem already. Text which placed in behind of contents isn't a TextLayer to remove it by parsing it as stream. — MKH, May 19 '16 at 03:27

score 2 · Accepted Answer · edited May 23 '17 at 10:33

Just like in the case of the question the OP references (Removing Watermark from PDF iTextSharp), you can remove the watermark from your sample file by building upon the PdfContentStreamEditor class presented in my answer to that question.

In contrast to the solution in that other answer, though, we do not want to hide vector graphics based on some transparency value but instead the writing "Archive of SID" from this:

First we have to select a criterion to recognize the background text by. Let's use the fact that the writing is by far the largest here. Using this criterion makes the task at hand essentially the iTextSharp/C# pendant to this iText/Java solution.

There is a problem, though: As mentioned in that answer:

The gs().getFontSize() used in the second sample may not be what you expect it to be as sometimes the coordinate system has been stretched by the current transformation matrix and the text matrix. The code can be extended to consider these effects.

Exactly this is happening here: A font size of 1 is used and that small text then is stretched by means of the text matrix:

/NxF0 1 Tf
49.516754 49.477234 -49.477234 49.516754 176.690933 217.316086 Tm

Thus, we need to take the text matrix into account. Unfortunately the text matrix is a private member. Thus, we will also need some reflection magic.

Thus, a possible background remover for that file looks like this:

class BigTextRemover : PdfContentStreamEditor
{
    protected override void Write(PdfContentStreamProcessor processor, PdfLiteral operatorLit, List<PdfObject> operands)
    {
        if (TEXT_SHOWING_OPERATORS.Contains(operatorLit.ToString()))
        {
            Vector fontSizeVector = new Vector(0, Gs().FontSize, 0);
            Matrix textMatrix = (Matrix) textMatrixField.GetValue(this);
            Matrix curentTransformationMatrix = Gs().GetCtm();
            Vector transformedVector = fontSizeVector.Cross(textMatrix).Cross(curentTransformationMatrix);
            float transformedFontSize = transformedVector.Length;
            if (transformedFontSize > 40)
                return;
        }
        base.Write(processor, operatorLit, operands);
    }
    System.Reflection.FieldInfo textMatrixField = typeof(PdfContentStreamProcessor).GetField("textMatrix", System.Reflection.BindingFlags.NonPublic | System.Reflection.BindingFlags.Instance);
    List<string> TEXT_SHOWING_OPERATORS = new List<string>{"Tj", "'", "\"", "TJ"};
}

The 40 has been chosen with that text matrix in mind.

Applying it like this

[Test]
public void testRemoveBigText()
{
    string source = @"sid-1.pdf";
    string dest = @"sid-1-noBigText.pdf";

    using (PdfReader pdfReader = new PdfReader(source))
    using (PdfStamper pdfStamper = new PdfStamper(pdfReader, new FileStream(dest, FileMode.Create, FileAccess.Write)))
    {
        PdfContentStreamEditor editor = new BigTextRemover();

        for (int i = 1; i <= pdfReader.NumberOfPages; i++)
        {
            editor.EditPage(pdfStamper, i);
        }
    }
}

to your sample file results in:

I got the iTextSharp v5.5.13.1 package from NuGet, but I can't find PdfContentStreamEditor. Is there a specific version of the package to use or am I looking to the wrong library ? — Thomas Jomphe, Feb 12 '20 at 21:11
Sorry, I skipped this paragraph without realizing. I have tried this code with my PDF, but I dont understand why the code in this if {if (TEXT_SHOWING_OPERATORS.Contains(operatorLit.ToString()))"} is never executed — Thomas Jomphe, Feb 12 '20 at 21:47
That indicates that no text is drawn in the page content stream. Probably text is drawn in Xobjects or in patterns. Or what you percieve as text actually is drawn as vector or bitmap graphics. You have to know what you want to remove, not merely what it looks like in a viewer but what it is internally. — mkl, Feb 12 '20 at 22:05

Removing Text based watermarks using itextsharp

1 Answers1

Linked