-1

This question is another version of my old question: I want to get all objects except text object as an image from PDF using iTextSharp

Original question

I am developing a program to convert PDF to PPTX for specific reasons using iTextSharp. What I've done so far is to get all text objects and image objects and locations. But I'm feeling difficult to get vector drawings without texts (like tables). Actually it would be better if I can get them as images. My plan is to merge all objects except text objects as a background image and put text objects at proper locations. I tried to find similar questions here but no luck so far. If anyone knows how to do this particular job, please answer. Thanks.


I have been reading many related questions and discussions and decided to ask another version here. I have two plans left as follows. I would really appreciate if iText developers/experts could guide me.

Code snippet I'm using for getting text/image objects

public class MyLocationTextExtractionStrategy: IExtRenderListener, ITextExtractionStrategy,IElementListener
{
    //Text 
    public List<RectAndText> myPoints_txt = new List<RectAndText>();
    public List<RectAndImage> myPoints_img = new List<RectAndImage>();
    public FieldInfo GsField = typeof(TextRenderInfo).GetField("gs", System.Reflection.BindingFlags.NonPublic | System.Reflection.BindingFlags.Instance);
    public FieldInfo MarkedContentInfosField = typeof(TextRenderInfo).GetField("markedContentInfos", System.Reflection.BindingFlags.NonPublic | System.Reflection.BindingFlags.Instance);
    public FieldInfo MarkedContentInfoTagField = typeof(MarkedContentInfo).GetField("tag", System.Reflection.BindingFlags.NonPublic | System.Reflection.BindingFlags.Instance);
    PdfName EMBEDDED_DOCUMENT = new PdfName("EmbeddedDocument");

    //Image 
    public List<byte[]> Images = new List<byte[]>();
    public List<string> ImageNames = new List<string>();

    public bool Add(IElement element)
    {
        element = element;
        return true;
    }

    public void BeginTextBlock()
    {

    }

    public void ClipPath(int rule)
    {

    }

    public void EndTextBlock()
    {

    }

    public string GetResultantText()
    {
        return "";
    }

    public void ModifyPath(PathConstructionRenderInfo renderInfo)
    {
        // ****************************************
        // I think this point I can get info on Path
        // ****************************************
    }

    public void RenderImage(ImageRenderInfo renderInfo)
    {

        PdfImageObject image = renderInfo.GetImage();
        try
        { 
            image = renderInfo.GetImage();
            if (image == null) return;

            ImageNames.Add(string.Format(
              "Image{0}.{1}", renderInfo.GetRef().Number, image.GetFileType()
            ));

            //Write Image to byte
            using (MemoryStream ms = new MemoryStream(image.GetImageAsBytes()))
            {
                Images.Add(ms.ToArray());
            }
            Matrix matrix = renderInfo.GetImageCTM();

            this.myPoints_img.Add(new RectAndImage(matrix[Matrix.I31], matrix[Matrix.I32], matrix[Matrix.I11], matrix[Matrix.I12], Images));
        }
        catch (Exception e)
        {

        }
    }



    public iTextSharp.text.pdf.parser.Path RenderPath(PathPaintingRenderInfo renderInfo)
    {
        // ****************************************
        // I think this point I can get info on Path
        // ****************************************
        return null;
    }

    public  void RenderText(TextRenderInfo renderInfo)
    {

        DocumentFont _font = renderInfo.GetFont();

        LineSegment descentLine = renderInfo.GetDescentLine();
        LineSegment ascentLine = renderInfo.GetAscentLine();
        float x0 = descentLine.GetStartPoint()[0];
        float x1 = ascentLine.GetEndPoint()[0];
        float y0 = descentLine.GetStartPoint()[1];
        float y1 = ascentLine.GetEndPoint()[1];

        Rectangle rect = new Rectangle(x0,y0,x1,y1);
        GraphicsState gs = (GraphicsState)GsField.GetValue(renderInfo);
        float fontSize = gs.FontSize;
        String font_color = gs.FillColor.ToString().Substring(14,6);

        IList<MarkedContentInfo> markedContentInfos = (IList<MarkedContentInfo>)MarkedContentInfosField.GetValue(renderInfo);

        if (markedContentInfos != null && markedContentInfos.Count > 0)
        {
            foreach (MarkedContentInfo info in markedContentInfos)
            {
                if (EMBEDDED_DOCUMENT.Equals(MarkedContentInfoTagField.GetValue(info)))
                    return;
            }
        }

        this.myPoints_txt.Add(new RectAndText(rect, renderInfo.GetText(), fontSize,renderInfo.GetFont().PostscriptFontName, font_color));
    } 
}

New question

1) Can I remove all text objects from a PDF and output it to a new one? If yes, I can get all pages of the output as images and use them as backgrounds of a PPTX. Then I can finally write texts (already retrieved using ITextExtractionStrategy using the above code)

2) If 1) is not possible, I am going to retrieve all Path information from the original PDF (using IExtRenderListener) and draw them on a new Bitmap. Finally I can put it as a background and put texts/images on that. In this case using ModifyPath and RenderPath for retrieval of Path info is the right way?

I know this might seem to have multiple questions, but I think it's better to write all in a single thread to help understanding. I would really appreciate any tips or comments on my thoughts.

I believe @mkl, @Amine, @Bruno Lowagie could help me. Thanks in advance.

JM217
  • 696
  • 4
  • 18
  • I see my question get some downvotes. Please explain the reason. I'm looking forward to even a word from community. Thanks! – JM217 Jan 06 '19 at 07:06
  • It would be awesome if you could share a [mcve] of your progress so far. – mjwills Jan 06 '19 at 07:07
  • https://meta.stackexchange.com/questions/39223/one-post-with-multiple-questions-or-multiple-posts – mjwills Jan 06 '19 at 07:07

1 Answers1

2

In my answer to your old question I explained the meanings of those IExtRenderListener callback methods, so essentially the remaining question here is

1) Can I remove all text objects from a PDF and output it to a new one?

You can by making use of generic content stream editor class PdfContentStreamEditor from this answer. Simply derive from it like this

class TextRemover : PdfContentStreamEditor
{
    protected override void Write(PdfContentStreamProcessor processor, PdfLiteral operatorLit, List<PdfObject> operands)
    {
        if (!TEXT_SHOWING_OPERATORS.Contains(operatorLit.ToString()))
        {
            base.Write(processor, operatorLit, operands);
        }
    }
    List<string> TEXT_SHOWING_OPERATORS = new List<string> { "Tj", "'", "\"", "TJ" };
}

and use it like this

using (PdfReader pdfReader = new PdfReader(source))
using (PdfStamper pdfStamper = new PdfStamper(pdfReader, new FileStream(dest, FileMode.Create, FileAccess.Write), (char)0, true))
{
    pdfStamper.RotateContents = false;
    PdfContentStreamEditor editor = new TextRemover();

    for (int i = 1; i <= pdfReader.NumberOfPages; i++)
    {
        editor.EditPage(pdfStamper, i);
    }
}

will remove all text drawing instructions from the immediate page content streams, e.g. for the example PDF I used

original

it creates the following output:

all text removed

Beware, as said above, only the immediate page content streams are changed. For a full solution one has to apply the TextRemover also to the XObjects and Patterns of the pages, and recursively so.

mkl
  • 90,588
  • 15
  • 125
  • 265