0

I have a question. I have this program in C# which creates a PDF using iText7, using exclusively images. (here's an example)

public void fromJPEGFileToPDF(String[] pathFiles, string path)
        {
            PdfWriter writer = new PdfWriter(path);
            PdfDocument pdfDoc = new PdfDocument(writer);
            Document doc = new Document(pdfDoc);

            for (int i = 0; i < pathFiles.Length; i++)
            {try
                {
                    ImageData imageData = ImageDataFactory.Create(FileToByte(pathFiles[i]));
                    ImagePDF imgTemp = new ImagePDF(imageData);
                    doc.Add(imgTemp);
                }
                catch (Exception e)
                {
                    //DO SOMETHING
                }
            }
            doc.Close();
        }

After that, I have this other function that extracts back the images from the file PDF, but it doesn't work as I'd want, as I often get images that don't open. To be fair, I get usually the proper image(s) I put in the pdf + some void files of small dimension that don't look like image files at all. And lately I got some pdf with not extractable images at all. Can I ask for your help? Here's my code:

public void fromPDFtoJPEGFiles(string pathFolder, string filename )
        {
            PdfDocument pdfDoc = new PdfDocument(new PdfReader(pathFolder+ "\\" + filename + ".pdf"));
            PdfObject obj;
            List<int> streamLengths = new List<int>();
            for (int i = 1; i <= pdfDoc.GetNumberOfPdfObjects(); i++)
            {
                obj = pdfDoc.GetPdfObject(i);
                if (obj != null && obj.IsStream())
                {
                    byte[] b;
                    try
                    {
                        b = ((PdfStream)obj).GetBytes();
                    }
                    catch (PdfException exc)
                    {
                        b = ((PdfStream)obj).GetBytes(false);
                    }
                    MemoryStream fos = new MemoryStream(b);
                    FileStream file = new FileStream(pathFolder+ filename + "(" + (i + 1) + ").jpg", FileMode.Create, System.IO.FileAccess.Write);
                    fos.WriteTo(file);

                    streamLengths.Add(b.Length);
                    fos.Close();
                    file.Close();
                }
            }
            pdfDoc.Close();
        }
shinzant
  • 1
  • 1
  • 1
    Your image extraction code is incorrect, not every stream in a PDF is an image, and even those streams that contain image data do not all contain a JPEG, and even those streams that contain a JPEG also contain some extra data probably required to correctly display the JPEG. Unfortunately you don't share the PDF in question, so it's difficult to tell which exact problem you run into. – mkl Nov 18 '20 at 15:40
  • You might want to try executing the code from [this answer](https://stackoverflow.com/a/59738579/1729265) – mkl Nov 18 '20 at 17:31

1 Answers1

-1

There are a couple of things on this thread that might help: How to extract images from pdf using iText7 c#

The use of -1 in iteration might be the key? Not sure.

I played with parsing PDF awhile back. It is a pain and not for the weary. For images, we used this example to set up the POC. It is not iText7.

Gregory A Beamer
  • 16,870
  • 3
  • 25
  • 32