I recently discovered iTextSharp.
I was investigating a performance issue with the rendering of PDF documents and Bruno Lowagie (author of iText) kindly explained to me the reason why I was experiencing such an issue : it was due to the amount of "Inline Images" in my PDF documents. He also explained the basics to remove those "Inline Images"... (My purpose is to "possibly" show a preview of the document with a clear notice that it's not the actual document and that this one could be very slow to open. I clearly understand that what I am trying to do is far from robust/safe/... The problem must be solved at another level, e.g.: when generating the documents, ...)
Unfortunately, I don't succeed in implementing the clean-up on my own :/ Here is some code I currently have (inspired from various samples found on stackOverflow)...
PdfReader pdfReader = new PdfReader(filename);
try
{
//pdfReader.RemoveUnusedObjects();
var cleanfilename = filename.Replace(".pdf", ".clean.pdf");
if (File.Exists(cleanfilename))
File.Delete(cleanfilename);
using (var file = new FileStream(cleanfilename, FileMode.Create))
{
var pdfstamper = new PdfStamper(pdfReader, file);
for (var page = 1; page <= pdfReader.NumberOfPages; page++)
{
PdfDictionary pageDict = pdfReader.GetPageN(page);
PdfObject pageObj = pageDict.GetDirectObject(PdfName.CONTENTS);
if (pageObj.IsStream())
{
CleanStream(pageObj);
}
else if (pageObj.IsArray())
{
PdfArray pageArray = pageDict.GetAsArray(PdfName.CONTENTS);
for (int j = 0; j < pageArray.Size; j++)
{
PdfIndirectReference arrayElement = (PdfIndirectReference)pageArray[j];
pageObj = pdfReader.GetPdfObject(arrayElement.Number);
if (pageObj.IsStream())
{
CleanStream(pageObj);
}
}
}
}
pdfstamper.Close();
}
}
catch (Exception ex)
{
MessageBox.Show("Error: " + ex.Message, "Error");
}
finally
{
pdfReader.Close();
}
and
Regex regEx = new Regex("\\nBI.*?\\nEI", RegexOptions.Compiled);
private void CleanStream(PdfObject obj)
{
var stream = (PRStream)obj;
var data = PdfReader.GetStreamBytes(stream);
var currentContent = Encoding.ASCII.GetString(data);
var newContent = regEx.Replace(currentContent, "");
var newData = Encoding.ASCII.GetBytes(newContent);
stream.SetData(newData);
}
It works fine on PDF without Inline Images... But "Text" is disappearing from pages where there are Inline Images.
I thought the problem was with the Replacement. But it's not the case as far as I can tell. Using the following code (kind of passthrough), the output document is ok:
private void CleanStream(PdfObject obj)
{
var stream = (PRStream)obj;
var data = PdfReader.GetStreamBytes(stream);
stream.SetData(data);
}
Using however the following code, which is theoretically not changing any byte (does it ?), the output documents does not display fine any more (some content seems to not be rendered) ?!?!?
private void CleanStream(PdfObject obj)
{
var stream = (PRStream)obj;
var data = PdfReader.GetStreamBytes(stream);
var currentContent = Encoding.ASCII.GetString(data);
var newData = Encoding.ASCII.GetBytes(currentContent);
stream.SetData(newData);
}
I looks like converting the byte array into a string and back into an array is not a "transparent" operation.
I really don't get it !?! But on the other side, I know I am real beginner regarding PDF. What am I missing ?
This is not at all critical (I don't really care if I can't succeed in removing those inline images). But I am now really curious about understanding what's happening :D
Here is a PDF sample : https://drive.google.com/file/d/0Byqch0ZyIb5DWDdmSTJ3SDMxMW8/edit?usp=sharing