-1

transform pdf points to pixels, worked correctly: point-to-pixel = 1/72*300(DPI)

  1. getting each text chunk positions (X,Y) in PDF the Y is calculated from
    bottom-to-top, not as in standard html or java Script.
  2. to get the Y value from top-to-down , cause not accurate Y position as in
    html style , or win Form style.
  3. how to get the correct Y top-to-down using any Page height, or rect mediaBox
    or cropBox or rect textMarging finder ?

  4. the code I used is your example of :

    public class LocationTextExtractionStrategyClass : LocationTextExtractionStrategy
    {
        //Hold each coordinate
        public List<RectAndText> myPoints = new List<RectAndText>();
        /*
        //The string that we're searching for
        public String TextToSearchFor { get; set; }
    
        //How to compare strings
        public System.Globalization.CompareOptions CompareOptions { get; set; }
    
        public MyLocationTextExtractionStrategy(String textToSearchFor, System.Globalization.CompareOptions compareOptions = System.Globalization.CompareOptions.None)
        {
            this.TextToSearchFor = textToSearchFor;
            this.CompareOptions = compareOptions;
        }
        */
        //Automatically called for each chunk of text in the PDF
        public override void RenderText(TextRenderInfo renderInfo)
        {
            base.RenderText(renderInfo);
    
            //See if the current chunk contains the text
            var startPosition = 0;// System.Globalization.CultureInfo.CurrentCulture.CompareInfo.IndexOf(renderInfo.GetText(), this.TextToSearchFor, this.CompareOptions);
    
            //If not found bail
            if (startPosition < 0)
            {
                return;
            }
    
            //Grab the individual characters
            var chars = renderInfo.GetCharacterRenderInfos().ToList();//.Skip(startPosition).Take(this.TextToSearchFor.Length)
            var charsText = renderInfo.GetText();
    
            //Grab the first and last character
            var firstChar = chars.First();
            var lastChar  = chars.Last();
    
            //Get the bounding box for the chunk of text
            var bottomLeft = firstChar.GetDescentLine().GetStartPoint();
            var topRight   = lastChar.GetAscentLine().GetEndPoint();
    
            //Create a rectangle from it
            var rect = new iTextSharp.text.Rectangle(
                                                    bottomLeft[Vector.I1],
                                                    bottomLeft[Vector.I2],
                                                    topRight[Vector.I1],
                                                    topRight[Vector.I2]
                                                    );
    
            BaseColor curColor = new BaseColor(0f, 0f, 0f);
            if (renderInfo.GetFillColor() != null)
                curColor = renderInfo.GetFillColor();
    
            //Add this to our main collection
            myPoints.Add(new RectAndText(rect, charsText, curColor));//this.TextToSearchFor));
        }
    }//end-of-txtLocation-class//
    
mkl
  • 90,588
  • 15
  • 125
  • 265
ell
  • 11
  • 4
  • 1
    Please show some code that causes the problem. – Fildor Jan 16 '18 at 09:11
  • 2
    What is so "standard", let alone "scientific", about **y** coordinate values increasing downwards with 0 at the top? The PDF standard is ISO 32000-2, and the standard uses **y** coordinates increasing upwards with 0 depending on the choice of the PDF creator. And coordinate systems in mathematics (which is science in its purest form) also usually have **y** coordinates increasing upwards. – mkl Jan 16 '18 at 09:16
  • That been said, you appear to have multiple problems converting regular PDF coordinates to your preferred coordinates. Not all of these problems appear to be a matter of coordinate system alone. I assume you also want the coordinates of a different point of the text chunk, not the baseline which is the point used in PDFs. Furthermore, you also appear to have problem with box dimensions. But all of this is pretty unclear in your question. Thus, please define clearly what type of coordinates you want. Also clarify your other problems (if my impression is correct that there are some). – mkl Jan 16 '18 at 09:24
  • When I was an engineering student, there was a huge rivalry between the University of Ghent and the University of Leuven. That rivalry also extended to the way we draw coordinate systems. In Ghent, we drew our coordinate system with the Y-Axis pointing up; in Leuven, they drew the coordinate sysem with the Y-Axis pointing down. Or was it the other way around? I don't remember, and it doesn't matter for a true developer. A true developer uses whathever convention is agreed upon in the spec. – Bruno Lowagie Jan 16 '18 at 09:36
  • 1
    As for the other questions: the question is really too broad. I've tried answering them all by pointing out that all of this has been discussed before on Stack Overflow, but it's hard to give an accurate answer because the OP doesn't explain what he wants to do and why his comments are relevant. The question reads as a rant more than it reads as a question. – Bruno Lowagie Jan 16 '18 at 09:38
  • @mkl , I reacted to whole comments, I add code, pls note the the first coordinate (from 4) gives : (0,791.9717) the X is correct the Y = 791.97, I have to calculate all other text chunks to the Y to correct (left,top) , means top=(pageHeight- Y) all – ell Jan 17 '18 at 13:52
  • Without the PDF in question I cannot say whether `791.9717` or `791.97` is correct (or whether both are false...); I'd also need the PDF in question to analyze the situation in iText if `791.9717` was wrong. Generally speaking iText is not known to return wrong entries here. On the other hand iText often uses single precision `float` variables for coordinates so in some contexts there may be rounding errors. – mkl Jan 17 '18 at 16:02
  • how I add the PDF to this site ? the Y position mismatch is more than rounding, it's about 9-11 points I don't say the page height of 791.9717 is wrong, what I cannot combine is the text chunks Y position calculate from top, if I understand relative to page Height(with or without some offset or margin) – ell Jan 17 '18 at 17:44
  • mkl, bruno is my question well understand ? do I have to add more code ? by general I asked to convert coordinates from PDF system to let say, HTML-javascript system in pixels, by looping on each text chunk – ell Jan 18 '18 at 09:34
  • *"how I add the PDF to this site"* - stack overflow does not allow file uploads other than images. For PDFs, therefore, one usually provides a public share on a file sharing provider (e.g. public shares on google drive or dropbox; please no services that spam ads all over the screen or even try to make downloaders load adwarez or worse) and posts the download URL here. – mkl Jan 19 '18 at 11:01
  • @mkl I found similar my question (from 2015, bruno in discussion..) that try to make some calculation on Y to get top-down Y value : https://stackoverflow.com/questions/27719060/how-to-change-the-coordiantes-of-a-text-in-a-pdf-page-from-lower-left-to-upper-l , I suspect that the Y coordinates in iText are little bias and not calculated well by CTS relative. the topic from 2015 does not give correct solution – ell Jan 21 '18 at 09:28
  • *"I suspect that the Y coordinates in iText are little bias and not calculated well by CTS relative. the topic from 2015 does not give correct solution"* - on the contrary, iText usually is correct in its calculations given a certain tolerance due to the single floats. – mkl Jan 21 '18 at 20:24
  • @mkl , it's much more "certain tolerance" to single floats , the difference in my simple PDF files, is about , 9-11 pdf points. the topic from 2015, I found where you and bruno, where involved, can I implement any answer, from this discussion ?. – ell Jan 22 '18 at 03:36
  • *"can I implement any answer, from this discussion ?"* - I don't know if you can. Try it. That being said, though, I'm still not sure what goes wrong in your case. I assume that your expectations are not aligned to your code. Unfortunately there is still no PDF file to check your code against, and your code does not really match the rest of your question. Thus, please clarify and provide the required information. – mkl Jan 22 '18 at 21:13
  • @mkl the attached pdf : https://drive.google.com/file/d/1fx_XammZGhPFgFP9k0n7A9vChRcJb1fm/view?usp=sharing – ell Jan 26 '18 at 11:24
  • First of all... *"pls note the the first coordinate (from 4) gives : (0,791.9717) the X is correct the Y = 791.97"* - The page dictionary of your PDF clearly contains `/CropBox[0 0 612.0 791.9717]/MediaBox[0 0 612.0 791.9717]`. Thus, the value `791.9717` iText returned was exactly correct while whoever made you believe in `791.97` was only approximately correct. – mkl Jan 26 '18 at 13:13
  • That being clarified, though, you mention you have more serious differences... *"the Y position mismatch is more than rounding, it's about 9-11 points"*. Thus, please describe a text chunk of which you believe there to be such a difference and tell which coordinate values you expect for it (and how you came to expect them) and which you get using iText. – mkl Jan 26 '18 at 13:18
  • @mkl following your last remark , I checked again the page height (791.9717 - uly (or ury) , and then calculate to pixels gave me correct Y-from-top in pixels, the mistake I made all over the way was, that I misunderstood, the 4 rectangle points (got from New(llx,lly),(urx,ury)...) and I used not the correct Y-in-PDF rectangle. to be sure I checked the values in image-from-pdf, and got from image same values as in PDF calculate. I was wrong, but all these checking caused me understand the PDF mechanizm. – ell Feb 16 '18 at 09:06
  • *"but all these checking caused me understand the PDF mechanizm"* - :) Yes, sometimes it takes some time to wrap one's brain around a concept... – mkl Feb 16 '18 at 12:12

1 Answers1

1

You are asking many different questions in one post.

First let's start with the coordinate system in the PDF standard. Observe that I am talking about a standard, more specifically about ISO 32000. The coordinate system on a PDF page is explained in my answer to the Stack Overflow question How should I interpret the coordinates of a rectangle in PDF?

enter image description here

As you can see, a rectangle drawn in a PDF using a coordinate (llx, lly) for the lower-left corner and a coordinate (urx, ury) for the upper-right corner, assumes that the X-axis points to the right, and the Y-axis points upwards.

As for the width and the height of a page, that's explained in my answer to the Stack Overflow question How to Get PDF page width and Height?

For instance: you could have a /MediaBox that is defined as [0 0 595 842], and therefore measures 595 x 842 points (an A4 page), but that has a /CropBox that is defined as [5 5 590 837], which means that the visible area is only 585 x 832 points.

You also shouldn't assume that the lower-left corner of a page coincides with the (0, 0) coordinate. See Where is the Origin (x,y) of a PDF page?

When you create a document from scratch, a default margin of half an inch is used if you don't define a margin yourself. If you want to change the default, see Fit content on pdf size with iTextSharp?

Now for the height of a Chunk or, if you're using iText 7 (which you should, but —for some reason unknown to me— don't) the height of a Text object, this depends on the font size. The font size is an average size of the different glyphs in a font. If you look at the letter g, and you compare it with the letter h, you see that g takes more space under the baseline of the text than h, whereas h takes more space above the baseline than g.

If you want to calculate the exact space that is taken, read my answer to the question How to calculate the height of an element?

If the text snippet is used in the context of lines in a paragraph, you also have to take the leading into account: Changing text line spacing (Maybe that's not relevant in the context of your question, but it's good to know.)

If you have Chunk objects in iText 5, and you want to do specific things with these Chunks, you might benefit from using page events. See How to draw a line every 25 words?

If you want to add a colored background to a Chunk, it's even easier: How to set the paragraph of itext pdf file as rectangle with background color in Java

Update 1: All of the above may be irrelevant if you are looking to convert HTML to PDF. In that case, it's easy: use iText 7 + pdfHTML as described in Converting HTML to PDF using iText and all the Math is done by the pdfHTML add-on.

Update 2: There seems to be some confusion regarding the measurement units. The differences between user units, points and pixels is explained in the FAQ page How do the measurement systems in HTML relate to the measurement system in PDF?

Summarized:

1 in. = 25.4 mm = 72 user units by default (but it can be changed).
1 in. = 25.4 mm = 72 pt.
1 in. = 25.4 mm = 96 px.
Bruno Lowagie
  • 75,994
  • 9
  • 109
  • 165
  • thanks for detailed answer, actually I need to get the (left,top) position in pixels , I have difficulties to calculate the (top) in pixels, what I am getting from the PDF is ury (Y from bottom) , how I convert ury or lly to (left,top) position, I get wrong top coordiante if I make pdf page height less lly, may be I have to add some offset or margin, that all – ell Jan 16 '18 at 20:25
  • about html to pdf, I''l check the itext version, but my last try I got not accurate PDF, when I used some other web-kit tools , it was accurate – ell Jan 16 '18 at 20:28
  • Please be aware that there's a difference between *user units* (PDF), *points* (typography), and *pixels* (images). The relation between these *different* measurement units is explained [here](https://developers.itextpdf.com/content/itext-7-converting-html-pdf-pdfhtml/chapter-7-frequently-asked-questions-about-pdfhtml/how-do-measurement-systems-html-relate-measurement-system-pdf). – Bruno Lowagie Jan 17 '18 at 08:00
  • I understand the difference, the X axis, width, and Height(3-6 pixels larger) , gives correct value dimensions, only the Y in PDF reports from bottom measures ?, do you have property to get the correct Y , as you give the correct X ? so I do not need to make any calculation ? (you will make it internally) – ell Jan 17 '18 at 08:08
  • there is also another problem in Y positioning, if the Font is bigger, it "moves" down(Y-larger) the chunk-Text . I have to check how the rect chunk influence on Y positioning – ell Jan 17 '18 at 08:13
  • I don't have sufficient information to answer the question. The Y coordinate of the bottom of the page can be found in the `/CropBox`. So is the Y coordinate of the top of the page. The top and border coordinates of the content on that page are harder to define. It can help if there's an `/ArtBox`, `/TrimBox` or `/BleedBox`, but if not, you'll have to parse the content. – Bruno Lowagie Jan 17 '18 at 08:13
  • iText 7 version , is this 5.5.7 version ?, I can get it from nuget ? – ell Jan 17 '18 at 08:14
  • As for your remark "if the font is bigger, it movers down", you are probably talking about the *leading*, but your questions are very unclear. – Bruno Lowagie Jan 17 '18 at 08:14
  • iText 5 is being phased out. When I say iText 7, I mean [iText 7](https://developers.itextpdf.com/itext7/download-and-install-information/NET). You can install iText 7 Core by typing the following command in the NuGet Package Manager: `Install-Package itext7` – Bruno Lowagie Jan 17 '18 at 08:16
  • my questions are only, read existing PDF, in test I made the mediaBox, and cropBox are same. I just ask if you can add Y position from top, as you give X-position, so I don't need to do any calculation to get (left,top) position – ell Jan 17 '18 at 08:18
  • I don't understand your question. You'll have to ask someone who understands. – Bruno Lowagie Jan 17 '18 at 08:18
  • 1
    @ell *"I need to get the (left,top) position"* - ah, so my impression was correct. Unfortunately you chose not to react to my comment to your question and clarify it. Additionally please add the code to your question which makes the problem including your expectations reproducible. – mkl Jan 17 '18 at 11:40
  • @mkl, I react again under your remark little above, have you seen it ? – ell Jan 19 '18 at 08:34
  • @ell *"have you seen it"* - I had not seen it until this comment of yours. Please always try to use the @... form to trigger notifications. – mkl Jan 19 '18 at 11:03