0

I have a PDF with four pages. Two images on the first page, one on the second, and one on the third. When I retrieve the value of the image on the second page or fourth,, I get a negative height. I tried setting it to Absolute as a quick fix but the Y position of the image was still slightly off. Also, the height and positioning on page three was fine.

Update: So far, this only seems to be a problem with PDF's created in Google Docs.

My code to extract the PDF images was taken from this thread Using iText 7, what's the proper way to export a Flate encoded image?.

This is how I access the height

var currentPDFImageInfo = extractedImages[i];
var currentPDFImageMatrix = currentPDFImageInfo.RenderInfo.GetImageCtm();
float pdfImageWidth = currentPDFImageMatrix.Get(iText.Kernel.Geom.Matrix.I11);

How I retrieve the PDF image data

public static List<PDFImageInfo> ExtractImagesFromPDF(string filePath)
{
     Reader   = new PdfReader(filePath);
     Document = new PdfDocument(Reader);

        
     var strategy = new ImageRenderListener();
     PdfCanvasProcessor parser = new PdfCanvasProcessor(strategy);
     for (int pageNumber = 1; pageNumber <= Document.GetNumberOfPages(); pageNumber++)
     { 
         strategy.CurrentPageNumber = pageNumber;
         parser.ProcessPageContent(Document.GetPage(pageNumber));
     }
     return strategy.ImageInfoList;
}

And of course the Strategy class

 public class ImageRenderListener : IEventListener
{
    public void EventOccurred(IEventData data, EventType type)
    {
        if (data is ImageRenderInfo imageData)
        {
            try
            {
                if (imageData.GetImage() == null)
                {
                    Console.WriteLine("Image could not be read.");
                }
                else
                {
                    var pdfImageInfo = new PDFImageInfo(CurrentPageNumber, imageData);
                    ImageInfoList.Add(pdfImageInfo);
                }
            }
            catch (Exception ex)
            {
                Console.WriteLine("Image could not be read: {0}.", ex.Message);
            }
        }
    }

    public ICollection<EventType> GetSupportedEvents()
    {
        return null;
    }

    public int CurrentPageNumber { get; set; }
    public List<PDFImageInfo> ImageInfoList { get; set; } = new List<PDFImageInfo>();
}
Tim
  • 170
  • 10

1 Answers1

0

This is how I access the height

var currentPDFImageInfo = extractedImages[i];
var currentPDFImageMatrix = currentPDFImageInfo.RenderInfo.GetImageCtm();
float pdfImageWidth = currentPDFImageMatrix.Get(iText.Kernel.Geom.Matrix.I11);

This value is the height only under certain circumstances.

Some backgrounds: The contents of a PDF page are drawn by a sequence of instructions in some content stream. Some of these instructions can manipulate the so called current transformation matrix (CTM) which represents an affine transformation, i.e. some combination of a rotation, translation, mirroring, and skewing. Everything other instructions draw is manipulated by the CTM value at the time that instruction is executed.

When a bitmap image is drawn, it is conceptually first reduced to a 1×1 square which then is transformed by the CTM to the final form of the image on the page.

If the image is displayed upright, no rotation or anything else involved, then indeed the I11 value is the width of the displayed image and the I22 value is the height. The I12 and I21 values are 0 then

But often bitmaps are displayed at 90° clockwise or counterclockwise (e.g. because someone held the camera at an 90° angle while shooting). In these cases I11 and I22 are 0 while I12 and I21 are the height and width respectively, with one or the other having a negative sign depending on the direction of the rotation.

If the bitmap is rotated by 180°, I11 and I22 again contain width and height, but both with a negative sign. If it's mirrored along the x axis or the y axis, one of them is negative.

And if the transformation is something else, e.g. a rotation by an angle that's not a multiple of 90°, finding the height and width becomes more complicated.

Actually then it is not even clear what height and width of the skewed, rotated, and mirrored form shall mean.

Thus, as a start please define which values you exactly are after; based on that you can try and determine them from arbitrary transformation matrices.


Another possible cause for unexplainable coordinate data for pages after the first one is that your code re-uses the PdfCanvasProcessor for each page without resetting:

var strategy = new ImageRenderListener();
PdfCanvasProcessor parser = new PdfCanvasProcessor(strategy);
for (int pageNumber = 1; pageNumber <= Document.GetNumberOfPages(); pageNumber++)
{ 
    strategy.CurrentPageNumber = pageNumber;
    parser.ProcessPageContent(Document.GetPage(pageNumber));
}

This causes the graphics state at the end of one page incorrectly to be used as starting graphics state of the next one. Instead you should either use a new PdfCanvasProcessor instance for each page or call parser.Reset() at the start of each page.

mkl
  • 90,588
  • 15
  • 125
  • 265
  • "I11 value is the height of the displayed image and the I22 value is the width." Are you sure you don't mean the other way around? Also are X and Y always I31 and I32? – Tim Jan 21 '21 at 18:02
  • *"Are you sure you don't mean the other way around?"* - You're right. I just fixed that. *"Also are X and Y always I31 and I32?"* - Yes, those values represent where the lower left corner of the original bitmap is mapped to. It's not necessarily the lower left of the result bitmap on the page, though, as the transformation may have mirrored or rotated it. – mkl Jan 21 '21 at 18:27
  • The failing coordinates I'm getting are of 0 degrees (i12 and i21 as 0). None of the values are negative. I'm looking at page 6 of this PDF https://www.nist.gov/system/files/documents/2019/09/11/nistir_8271_20190911.pdf Is there anything in the documents descriptor that might help? – Tim Jan 21 '21 at 19:27
  • I cannot find the `PDFImageInfo` class in your code which probably tries to extract the coordinates. Can you share it to allow reproducing the issue? – mkl Jan 22 '21 at 16:00
  • I am facing the same problem when analyzing PDF files generated with wkhtmltopdf. Could you please tell me whether you solved it? – Jesús Ángel Mar 02 '22 at 12:01
  • @JesúsÁngel *"I am facing the same problem when analyzing PDF files generated with wkhtmltopdf. Could you please tell me whether you solved it?"* - First of all it's not a problem. As explained in my answer, any of those matrix elements can be positive, zero, or negative. Why do you expect anything different? – mkl Mar 02 '22 at 12:10
  • @mkl I do appreciate your comment. The thing is that I am doing some tests with an HTML file that has just one JPG file placed on two different positions. The resulting PDF generated with wkhtmltopdf has two pages. The image copy placed on the first page has positive width and height values, however, not only does the image on the second page has a negative height, but it also has different absolute values. I expect both images to have the same dimensions. Besides, the only matrix indexes that are not 0 are width (I11), height (I22), x (I31), y (I32) and I33 which is 1. – Jesús Ángel Mar 02 '22 at 12:35
  • Do you re-use the `PdfCanvasProcessor` like the OP here does? In that case call its `Reset()` method for each page, otherwise the graphics state of one page incorrectly is used as starting graphics state of the next one. – mkl Mar 02 '22 at 15:53