0

I recently discovered iTextSharp.

I was investigating a performance issue with the rendering of PDF documents and Bruno Lowagie (author of iText) kindly explained to me the reason why I was experiencing such an issue : it was due to the amount of "Inline Images" in my PDF documents. He also explained the basics to remove those "Inline Images"... (My purpose is to "possibly" show a preview of the document with a clear notice that it's not the actual document and that this one could be very slow to open. I clearly understand that what I am trying to do is far from robust/safe/... The problem must be solved at another level, e.g.: when generating the documents, ...)

Unfortunately, I don't succeed in implementing the clean-up on my own :/ Here is some code I currently have (inspired from various samples found on stackOverflow)...

PdfReader pdfReader = new PdfReader(filename);
try
{  
    //pdfReader.RemoveUnusedObjects();

    var cleanfilename = filename.Replace(".pdf", ".clean.pdf");
    if (File.Exists(cleanfilename))
        File.Delete(cleanfilename);

    using (var file = new FileStream(cleanfilename, FileMode.Create))
    {
        var pdfstamper = new PdfStamper(pdfReader, file);

        for (var page = 1; page <= pdfReader.NumberOfPages; page++)
        {    
            PdfDictionary pageDict = pdfReader.GetPageN(page);
            PdfObject pageObj = pageDict.GetDirectObject(PdfName.CONTENTS);
            if (pageObj.IsStream())
            {
                CleanStream(pageObj);
            }
            else if (pageObj.IsArray())
            {
                PdfArray pageArray = pageDict.GetAsArray(PdfName.CONTENTS);

                for (int j = 0; j < pageArray.Size; j++)
                {
                    PdfIndirectReference arrayElement = (PdfIndirectReference)pageArray[j];
                    pageObj = pdfReader.GetPdfObject(arrayElement.Number);
                    if (pageObj.IsStream())
                    {
                        CleanStream(pageObj);
                    }
                }
            }
        }

        pdfstamper.Close();
    }
}
catch (Exception ex)
{
    MessageBox.Show("Error: " + ex.Message, "Error");
}
finally
{
    pdfReader.Close();
}

and

Regex regEx = new Regex("\\nBI.*?\\nEI", RegexOptions.Compiled);

private void CleanStream(PdfObject obj)
{
    var stream = (PRStream)obj;
    var data = PdfReader.GetStreamBytes(stream);

    var currentContent = Encoding.ASCII.GetString(data);    
    var newContent = regEx.Replace(currentContent, "");
    var newData = Encoding.ASCII.GetBytes(newContent);

    stream.SetData(newData);
}

It works fine on PDF without Inline Images... But "Text" is disappearing from pages where there are Inline Images.

I thought the problem was with the Replacement. But it's not the case as far as I can tell. Using the following code (kind of passthrough), the output document is ok:

private void CleanStream(PdfObject obj)
{
    var stream = (PRStream)obj;
    var data = PdfReader.GetStreamBytes(stream);

    stream.SetData(data);
}

Using however the following code, which is theoretically not changing any byte (does it ?), the output documents does not display fine any more (some content seems to not be rendered) ?!?!?

private void CleanStream(PdfObject obj)
{
    var stream = (PRStream)obj;
    var data = PdfReader.GetStreamBytes(stream);

    var currentContent = Encoding.ASCII.GetString(data);    
    var newData = Encoding.ASCII.GetBytes(currentContent);

    stream.SetData(newData);
}

I looks like converting the byte array into a string and back into an array is not a "transparent" operation.

I really don't get it !?! But on the other side, I know I am real beginner regarding PDF. What am I missing ?

This is not at all critical (I don't really care if I can't succeed in removing those inline images). But I am now really curious about understanding what's happening :D

Here is a PDF sample : https://drive.google.com/file/d/0Byqch0ZyIb5DWDdmSTJ3SDMxMW8/edit?usp=sharing

Valery Letroye
  • 1,035
  • 11
  • 19
  • I am actually unsure about the encoding of the PDF... is it really ASCII ? I tried with UTF8 without much success... Just in case, I am right now trying to deal with the bytes without converting... And the results seems not bad :p I will post my code here after asap. – Valery Letroye Jul 13 '14 at 20:06

2 Answers2

1

As you've found out and as mkl and I pointed out in the comments, it's not a good idea to manipulate a content stream without taking a look at every operator in the stream. You really need to parse the syntax and interpret every single operator and every single operand.

Please take a look at the OCG removing functionality in the extra jar that is provided with iText in the com.itextpdf.text.pdf.ocg/ package.

In the OCGParser class, we define all possible operators:

protected void populateOperators() {
    if (operators != null)
        return;
    operators = new HashMap<String, PdfOperator>();
    operators.put(DEFAULTOPERATOR, new CopyContentOperator());
    PathConstructionOrPaintingOperator opConstructionPainting = new PathConstructionOrPaintingOperator();
    operators.put("m", opConstructionPainting);
    operators.put("l", opConstructionPainting);
    operators.put("c", opConstructionPainting);
    operators.put("v", opConstructionPainting);
    operators.put("y", opConstructionPainting);
    operators.put("h", opConstructionPainting);
    operators.put("re", opConstructionPainting);
    operators.put("S", opConstructionPainting);
    operators.put("s", opConstructionPainting);
    operators.put("f", opConstructionPainting);
    operators.put("F", opConstructionPainting);
    operators.put("f*", opConstructionPainting);
    operators.put("B", opConstructionPainting);
    operators.put("B*", opConstructionPainting);
    operators.put("b", opConstructionPainting);
    operators.put("b*", opConstructionPainting);
    operators.put("n", opConstructionPainting);
    operators.put("W", opConstructionPainting);
    operators.put("W*", opConstructionPainting);
    GraphicsOperator graphics = new GraphicsOperator();
    operators.put("q", graphics);
    operators.put("Q", graphics);
    operators.put("w", graphics);
    operators.put("J", graphics);
    operators.put("j", graphics);
    operators.put("M", graphics);
    operators.put("d", graphics);
    operators.put("ri", graphics);
    operators.put("i", graphics);
    operators.put("gs", graphics);
    operators.put("cm", graphics);
    operators.put("g", graphics);
    operators.put("G", graphics);
    operators.put("rg", graphics);
    operators.put("RG", graphics);
    operators.put("k", graphics);
    operators.put("K", graphics);
    operators.put("cs", graphics);
    operators.put("CS", graphics);
    operators.put("sc", graphics);
    operators.put("SC", graphics);
    operators.put("scn", graphics);
    operators.put("SCN", graphics);
    operators.put("sh", graphics);
    XObjectOperator xObject = new XObjectOperator();
    operators.put("Do", xObject);
    InlineImageOperator inlineImage = new InlineImageOperator();
    operators.put("BI", inlineImage);
    operators.put("EI", inlineImage);
    TextOperator text = new TextOperator();
    operators.put("BT", text);
    operators.put("ID", text);
    operators.put("ET", text);
    operators.put("Tc", text);
    operators.put("Tw", text);
    operators.put("Tz", text);
    operators.put("TL", text);
    operators.put("Tf", text);
    operators.put("Tr", text);
    operators.put("Ts", text);
    operators.put("Td", text);
    operators.put("TD", text);
    operators.put("Tm", text);
    operators.put("T*", text);
    operators.put("Tj", text);
    operators.put("'", text);
    operators.put("\"", text);
    operators.put("TJ", text);
    MarkedContentOperator markedContent = new MarkedContentOperator();
    operators.put("BMC", markedContent);
    operators.put("BDC", markedContent);
    operators.put("EMC", markedContent);
}

The parse() method will look at all the content streams, including the content streams of Form XObjects (which you are overlooking if I understand your code correctly).

In the process() method, we make copies of every operator and all its operands, unless some condition tells us that part of the syntax needs to be removed.

You should adapt this code so that all operators are copied, except those that involve an inline images. Your approach was a brute force approach that was bound to damage more PDFs than it would ever fix.

Bruno Lowagie
  • 75,994
  • 9
  • 109
  • 165
  • Ok... I had a look and think I see how to proceed... Unfortunately, if the code is working fine on PDF without Inline Images, I have always something going wrong when I run it on a PDF with such objects :( Here is a dedicated thread about my new issues: http://stackoverflow.com/questions/24867577/itext-something-goes-wrong-when-parsing-the-content-pdf-with-inline-images – Valery Letroye Jul 21 '14 at 14:27
0

Instead of working on strings, I work now directly on the bytes...

private void CleanStream(PdfObject obj)
{
    var stream = (PRStream)obj;
    var data = PdfReader.GetStreamBytes(stream);
    var workingData = new byte[data.Length];

    var BI = Encoding.ASCII.GetBytes("\nBI");
    var EI = Encoding.ASCII.GetBytes("\nEI");

    var len = EI.Length - 1;
    var BIpos = data.Locate(BI);
    var EIpos = data.Locate(EI);
    var pos = BIpos.Length;
    if (pos != EIpos.Length)
        throw new Exception("BI and EI operators not matching ?!");

    var skip = 0;
    var newI = 0;
    for (var i = 0; i < data.Length; i++)
    {
        if (skip >= pos || i < BIpos[skip])
        {
            workingData[newI] = data[i];
            newI++;
        }
        else if (i >= EIpos[skip] + len)
            skip++;
    }

    var newData = new byte[newI];
    Array.Copy(workingData, newData, newI);

    stream.SetData(newData);
}

"Locate" is the extension method suggested here : byte[] array pattern search

Any comment on this solution is welcome!

Community
  • 1
  • 1
Valery Letroye
  • 1,035
  • 11
  • 19
  • Are you sure **BI** and **EI** need to be preceded by a line break? Furthermore, you completely ignore the structure of the stream. What about e.g. displayed strings containing a *BI* or *EI*? – mkl Jul 13 '14 at 23:32
  • mkl is right: searching for `BI` and `EI` and removing everything inside is wrong. You need to parse the syntax in a decent way, because not all occurrences of `BI` are actually the Begin Image operator, nor are all occurrences of `EI` instances of the End Image operator. – Bruno Lowagie Jul 14 '14 at 05:57
  • I will try something like the OCG Remover for learning purpose... But I wanted the fastest method and not the safest. With several thousands of Inline Images, some PDF (1% of the library) are too slow to be rendered on Citrix Server by my embedded viewer. The idea was therefore to create on the fly a "quick&dirty preview" of those PDF and notify the user that he had to open the document with an external viewer. With the code above and additional tests such as BIpos(i) < EIpos(I) and len > 2500, it's IMO safe enough for my own, specific purpose. My viewer runs fine with < 2500 Inline images. – Valery Letroye Jul 15 '14 at 19:31
  • FYI: I actually only check how many images there are in the PDF. It takes only a few hundreds of msec, which is "acceptable". As soon as there are more than 2500 inline images, I display a dummy PDF asking the user to click to open the document within an external viewer. I have not yet implemented the clean-up imagined above... The goal is to find how to "seriously fix" the problem, either on Citrix (currently 4 to 10 times slower than a local execution) or with the embedded viewer (a third party) or within the PDF (possibly converting the existing PDF into a TIFF in a batch process?!). – Valery Letroye Jul 15 '14 at 19:36