Decode PDF text correctly with iText for simple text replacement

Question

I'm trying to replace some text in a PDF/A document with iText in .NET. (I know replacing text is not perfect in pdfs)

This is what PDF Debugger is showing me as contents. As far as I understand Tj should be the "text" and Tm are positions for the text.

Producer: ABBYY Recognition Server

Filter: FlateDecode

      /F_0 11 Tf
      BT
        1.4445 0 0 1 43.7 797.78 Tm
        [ (\000\033\000\010\000\020\000\021\000\032\000\025) 11 (\000\027) ] TJ
        2.3454 0 0 1 96.5 797.78 Tm
        (\000\001) Tj
        1.4637 0 0 1 102.95 797.78 Tm
        (\000\033\000\022\000\023\000\032\000\010\000\016\000\011\0007) Tj
        2.4545 0 0 1 160.55 797.78 Tm
        (\000\001) Tj
        1.0559 0 0 1 167.3 797.78 Tm
        (\000\031\000\027\0004) Tj
        1.7454 0 0 1 182.4 797.78 Tm
        (\000\001) Tj
        1.0403 0 0 1 187.19 797.78 Tm
        [ (\000\035\000\023\000\010\000\007) 9 (\0004) ] TJ
        1.7454 0 0 1 207.85 797.78 Tm
        (\000\001) Tj

This is currently my test code:

var pdfDoc = new PdfADocument(new PdfReader(src), new PdfWriter(dest));
PdfPage page = pdfDoc.GetFirstPage();
PdfDictionary dict = page.GetPdfObject();

PdfObject obj = dict.Get(PdfName.Contents);

PdfArray refs = null;
if (dict.Get(PdfName.Contents).IsArray())
{
    refs = dict.GetAsArray(PdfName.Contents);
}
else if (dict.Get(PdfName.Contents).IsIndirect())
{
    refs = new PdfArray(dict.Get(PdfName.Contents));
}

for (int i = 0; i < refs.Count(); i++)
{
    try
    {
        PdfStream stream = (PdfStream)refs.Get(i);
        byte[] data = stream.GetBytes(true);
        //var x = DecodeFromUtf8(ByteArrayToString(data));
        Console.WriteLine(ByteArrayToString(data));

        //This is just a test
        String replacedData = ByteArrayToString(data).Replace("the", "abc");
        stream.SetData(StringToByteArray(replacedData));
    } 
    catch
    {
        Console.WriteLine("i = " + i);
    }
}

//String byte converter
private string ByteArrayToString(byte[] arr)
{
    var enc = new System.Text.UTF8Encoding();
    return enc.GetString(arr);
}

private byte[] StringToByteArray(string str)
{
    var enc = new System.Text.UTF8Encoding();
    return enc.GetBytes(str);
}

With ByteArrayToString(stream.GetBytes(true)) the output looks like this:

    q/F_0 56 Tf BT 126.25 687.12 TD[(\0\u0002)6(\0\u0003)4(\0\u0004)]TJ 1.1499 0 0 1 215.75 687.12 Tm(\0\u0001)Tj 1.0093 0 0 1 231.85 687.12 Tm(\0\u0005\0\u0006\0\a\0\b\0\t\0\b\0\v)Tj/G cs 149.30 0 TD(\0\u0001)Tj 1.0428 0 0 1 33.100 638.42 Tm(\0\u0003\0\f\0\u000e)Tj 1.1499 0 0 1 120.70 638.42 Tm(\0\u0001)Tj 1.0323 0 0 1 136.80 638.42 Tm[(\0\u000f\0\u0010\0\b\0\t\0\u0011\0\u0010)2(\0\u0004\0\u0012\0\u0013\0\b\0\u0012\0\u0013)2(\0\t\0\u0010)]TJ/F_0 22 Tf 1.0193 0 0 1 55.900 572.62 Tm[(\0\u0014\0\u0010\0\b\0\a\0\u0015\0\u0011)5(\0\u0010)]TJ 1.2727 0 0 1 124.30 572.62 Tm(\0\u0001)Tj 1.0338 0 0 1 131.30 572.62 Tm(\0\u0016\0\u0003\0\u0017)Tj 1.0909 0 0 1 160.30 572.62 Tm(\0\u0001)Tj 1.0387 0 0 1 166.30 572.62 ...

In console window

    q/F_0 56 Tf BT 126.25 687.12 TD[(?)6(?)4(?)]TJ 1.1499 0 0 1 215.75 687.12 Tm(?)Tj 1.0093 0 0 1 231.85 687.12 Tm(?????   ??)Tj/G cs 149.30 0 TD(?)Tj 1.0428 0 0 1 33.100 638.42 Tm(???)Tj 1.1499 0 0 1 120.70 638.42 Tm(?)Tj 1.0323 0 0 1 13 ...

So I think I need a filter or decoder to "decode" the text, make my replacement and then convert it back again. Could someone give me a clue what I'm doing wrong. I never really worked with pdfs or iText before.

It works with "simple" pdfs where the text is not "encrypted" like in this pdf.

I can't share my pdf here, but I will try to find a similar one.

if you read here, there are multiple variations of the PDF-A spec, as it evolved over time. I think need to be more specific with which version of PDF-A your file is encode with so that you can decode based on the correct specifications. https://help.abbyy.com/en-us/finereader/14/user_guide/savepdf_a — Glenn Ferrie, May 25 '20 at 16:30
Your code only works for very simple PDFs. In other cases it will do nothing or even damage the PDF contents. Please read [this answer](https://stackoverflow.com/a/60655298/1729265) to understand why in general trying to edit PDF contents is highly non-trivial. That you mention PDF/A actually adds chances for some more complications... — mkl, May 25 '20 at 16:40
Ok good point, I will check it out. VeraPDF shows me Validation Profile: PDF/A-1B validation profile.PDF/A compliance: Passed. — theChaosCoder, May 25 '20 at 16:51

Decode PDF text correctly with iText for simple text replacement

0 Answers0