This question is another version of my old question: I want to get all objects except text object as an image from PDF using iTextSharp
Original question
I am developing a program to convert PDF to PPTX for specific reasons using iTextSharp. What I've done so far is to get all text objects and image objects and locations. But I'm feeling difficult to get vector drawings without texts (like tables). Actually it would be better if I can get them as images. My plan is to merge all objects except text objects as a background image and put text objects at proper locations. I tried to find similar questions here but no luck so far. If anyone knows how to do this particular job, please answer. Thanks.
I have been reading many related questions and discussions and decided to ask another version here. I have two plans left as follows. I would really appreciate if iText developers/experts could guide me.
Code snippet I'm using for getting text/image objects
public class MyLocationTextExtractionStrategy: IExtRenderListener, ITextExtractionStrategy,IElementListener
{
//Text
public List<RectAndText> myPoints_txt = new List<RectAndText>();
public List<RectAndImage> myPoints_img = new List<RectAndImage>();
public FieldInfo GsField = typeof(TextRenderInfo).GetField("gs", System.Reflection.BindingFlags.NonPublic | System.Reflection.BindingFlags.Instance);
public FieldInfo MarkedContentInfosField = typeof(TextRenderInfo).GetField("markedContentInfos", System.Reflection.BindingFlags.NonPublic | System.Reflection.BindingFlags.Instance);
public FieldInfo MarkedContentInfoTagField = typeof(MarkedContentInfo).GetField("tag", System.Reflection.BindingFlags.NonPublic | System.Reflection.BindingFlags.Instance);
PdfName EMBEDDED_DOCUMENT = new PdfName("EmbeddedDocument");
//Image
public List<byte[]> Images = new List<byte[]>();
public List<string> ImageNames = new List<string>();
public bool Add(IElement element)
{
element = element;
return true;
}
public void BeginTextBlock()
{
}
public void ClipPath(int rule)
{
}
public void EndTextBlock()
{
}
public string GetResultantText()
{
return "";
}
public void ModifyPath(PathConstructionRenderInfo renderInfo)
{
// ****************************************
// I think this point I can get info on Path
// ****************************************
}
public void RenderImage(ImageRenderInfo renderInfo)
{
PdfImageObject image = renderInfo.GetImage();
try
{
image = renderInfo.GetImage();
if (image == null) return;
ImageNames.Add(string.Format(
"Image{0}.{1}", renderInfo.GetRef().Number, image.GetFileType()
));
//Write Image to byte
using (MemoryStream ms = new MemoryStream(image.GetImageAsBytes()))
{
Images.Add(ms.ToArray());
}
Matrix matrix = renderInfo.GetImageCTM();
this.myPoints_img.Add(new RectAndImage(matrix[Matrix.I31], matrix[Matrix.I32], matrix[Matrix.I11], matrix[Matrix.I12], Images));
}
catch (Exception e)
{
}
}
public iTextSharp.text.pdf.parser.Path RenderPath(PathPaintingRenderInfo renderInfo)
{
// ****************************************
// I think this point I can get info on Path
// ****************************************
return null;
}
public void RenderText(TextRenderInfo renderInfo)
{
DocumentFont _font = renderInfo.GetFont();
LineSegment descentLine = renderInfo.GetDescentLine();
LineSegment ascentLine = renderInfo.GetAscentLine();
float x0 = descentLine.GetStartPoint()[0];
float x1 = ascentLine.GetEndPoint()[0];
float y0 = descentLine.GetStartPoint()[1];
float y1 = ascentLine.GetEndPoint()[1];
Rectangle rect = new Rectangle(x0,y0,x1,y1);
GraphicsState gs = (GraphicsState)GsField.GetValue(renderInfo);
float fontSize = gs.FontSize;
String font_color = gs.FillColor.ToString().Substring(14,6);
IList<MarkedContentInfo> markedContentInfos = (IList<MarkedContentInfo>)MarkedContentInfosField.GetValue(renderInfo);
if (markedContentInfos != null && markedContentInfos.Count > 0)
{
foreach (MarkedContentInfo info in markedContentInfos)
{
if (EMBEDDED_DOCUMENT.Equals(MarkedContentInfoTagField.GetValue(info)))
return;
}
}
this.myPoints_txt.Add(new RectAndText(rect, renderInfo.GetText(), fontSize,renderInfo.GetFont().PostscriptFontName, font_color));
}
}
New question
1) Can I remove all text objects from a PDF and output it to a new one? If yes, I can get all pages of the output as images and use them as backgrounds of a PPTX. Then I can finally write texts (already retrieved using ITextExtractionStrategy using the above code)
2) If 1) is not possible, I am going to retrieve all Path information from the original PDF (using IExtRenderListener) and draw them on a new Bitmap. Finally I can put it as a background and put texts/images on that. In this case using ModifyPath and RenderPath for retrieval of Path info is the right way?
I know this might seem to have multiple questions, but I think it's better to write all in a single thread to help understanding. I would really appreciate any tips or comments on my thoughts.
I believe @mkl, @Amine, @Bruno Lowagie could help me. Thanks in advance.