I'm trying to replace some text in a PDF/A document with iText
in .NET. (I know replacing text is not perfect in pdfs)
This is what PDF Debugger is showing me as contents. As far as I understand Tj
should be the "text" and Tm
are positions for the text.
Producer: ABBYY Recognition Server
Filter: FlateDecode
/F_0 11 Tf
BT
1.4445 0 0 1 43.7 797.78 Tm
[ (\000\033\000\010\000\020\000\021\000\032\000\025) 11 (\000\027) ] TJ
2.3454 0 0 1 96.5 797.78 Tm
(\000\001) Tj
1.4637 0 0 1 102.95 797.78 Tm
(\000\033\000\022\000\023\000\032\000\010\000\016\000\011\0007) Tj
2.4545 0 0 1 160.55 797.78 Tm
(\000\001) Tj
1.0559 0 0 1 167.3 797.78 Tm
(\000\031\000\027\0004) Tj
1.7454 0 0 1 182.4 797.78 Tm
(\000\001) Tj
1.0403 0 0 1 187.19 797.78 Tm
[ (\000\035\000\023\000\010\000\007) 9 (\0004) ] TJ
1.7454 0 0 1 207.85 797.78 Tm
(\000\001) Tj
This is currently my test code:
var pdfDoc = new PdfADocument(new PdfReader(src), new PdfWriter(dest));
PdfPage page = pdfDoc.GetFirstPage();
PdfDictionary dict = page.GetPdfObject();
PdfObject obj = dict.Get(PdfName.Contents);
PdfArray refs = null;
if (dict.Get(PdfName.Contents).IsArray())
{
refs = dict.GetAsArray(PdfName.Contents);
}
else if (dict.Get(PdfName.Contents).IsIndirect())
{
refs = new PdfArray(dict.Get(PdfName.Contents));
}
for (int i = 0; i < refs.Count(); i++)
{
try
{
PdfStream stream = (PdfStream)refs.Get(i);
byte[] data = stream.GetBytes(true);
//var x = DecodeFromUtf8(ByteArrayToString(data));
Console.WriteLine(ByteArrayToString(data));
//This is just a test
String replacedData = ByteArrayToString(data).Replace("the", "abc");
stream.SetData(StringToByteArray(replacedData));
}
catch
{
Console.WriteLine("i = " + i);
}
}
//String byte converter
private string ByteArrayToString(byte[] arr)
{
var enc = new System.Text.UTF8Encoding();
return enc.GetString(arr);
}
private byte[] StringToByteArray(string str)
{
var enc = new System.Text.UTF8Encoding();
return enc.GetBytes(str);
}
With ByteArrayToString(stream.GetBytes(true))
the output looks like this:
q/F_0 56 Tf BT 126.25 687.12 TD[(\0\u0002)6(\0\u0003)4(\0\u0004)]TJ 1.1499 0 0 1 215.75 687.12 Tm(\0\u0001)Tj 1.0093 0 0 1 231.85 687.12 Tm(\0\u0005\0\u0006\0\a\0\b\0\t\0\b\0\v)Tj/G cs 149.30 0 TD(\0\u0001)Tj 1.0428 0 0 1 33.100 638.42 Tm(\0\u0003\0\f\0\u000e)Tj 1.1499 0 0 1 120.70 638.42 Tm(\0\u0001)Tj 1.0323 0 0 1 136.80 638.42 Tm[(\0\u000f\0\u0010\0\b\0\t\0\u0011\0\u0010)2(\0\u0004\0\u0012\0\u0013\0\b\0\u0012\0\u0013)2(\0\t\0\u0010)]TJ/F_0 22 Tf 1.0193 0 0 1 55.900 572.62 Tm[(\0\u0014\0\u0010\0\b\0\a\0\u0015\0\u0011)5(\0\u0010)]TJ 1.2727 0 0 1 124.30 572.62 Tm(\0\u0001)Tj 1.0338 0 0 1 131.30 572.62 Tm(\0\u0016\0\u0003\0\u0017)Tj 1.0909 0 0 1 160.30 572.62 Tm(\0\u0001)Tj 1.0387 0 0 1 166.30 572.62 ...
In console window
q/F_0 56 Tf BT 126.25 687.12 TD[(?)6(?)4(?)]TJ 1.1499 0 0 1 215.75 687.12 Tm(?)Tj 1.0093 0 0 1 231.85 687.12 Tm(????? ??)Tj/G cs 149.30 0 TD(?)Tj 1.0428 0 0 1 33.100 638.42 Tm(???)Tj 1.1499 0 0 1 120.70 638.42 Tm(?)Tj 1.0323 0 0 1 13 ...
So I think I need a filter or decoder to "decode" the text, make my replacement and then convert it back again. Could someone give me a clue what I'm doing wrong. I never really worked with pdfs or iText before.
It works with "simple" pdfs where the text is not "encrypted" like in this pdf.
I can't share my pdf here, but I will try to find a similar one.