0

Problem :- I can parse the PDF position operators in content stream. If the cooridnates are started from the left bottom to my calculations are getting correct and able to tag the content properly.

q1) In case the starting coordinates are changed i.e(top left, or top right or bottom right). The parsed coordinates are not matching to tag content , in this case how exactly to calculation will happen.

q2) If the starting points are changed how the content stream will represent it ?

for example "0 7.98 -7.98 0 90.8898 715.4183 Tm".

To Give you more explaination i am sharing 2 pdf which will help for better understanding.

PDF File One

We can look into this file the file Coordinates i.e (0,0) stats from bottom left and we are able to tag all the data in this file.

PDF File two

The Page coordinates(0,0) starts from top left. similarly there might be scenarios the coordinates might start from top right and bottom right , now the question is how to tag this kinds of files .

Thanks Tejas

  • Actually I don't know where to start explaining. How good is your content stream scanning code really? On one hand you say you *can parse PDF position operators in the content stream*, On the other hand you appear not to know how to handle the **Tm** operator. And clearly, the position and orientation of text in PDFs is determined by the **cm**, **Tm**, **Td**, **TD**, and a few more operators. Thus, please specify more clearly what your code currently does and in what manner it fails in your current tests. – mkl May 18 '22 at 09:04
  • Basically I haven't started coding part yet, i am trying to figure out how exactly we can tag the content in the pages in the different cases as i have shown above in the different cases by taking the example pdf's. i Am very new to this part i am trying to study more about the pdf and come up with this solution , If you can give me a road map to solution to this will be very helpfull. "How good is your content stream scanning code really?" :- I just checked the coordinates in the adobe in the page content and updated you . – tejas Reddy p May 19 '22 at 03:55
  • Ok, so you're in the learning phase. In that case be sure to read section 8 "Graphics" and section 9 "Text" of the PDF specification ISO 32000, either part 1 or part 2 (the main difference is that in the current part 2 text objects may contain instructions for saving and restoring the graphics state). You may want to start with [this answer](https://stackoverflow.com/a/16483429/1729265), though, which points out some important details. – mkl May 19 '22 at 06:20
  • (If you don't have a copy of the specification yet, you can download a copy of part 1 with ISO headers removed at http://www.adobe.com/go/pdfreference/ ). – mkl May 19 '22 at 06:21
  • I am very ThanksFull for the Explanation , will look into int and get back to you Thanks . – tejas Reddy p May 20 '22 at 04:56
  • Hi mkl Hope you are doing great , i have a new question i have created here plz let me know if you have any idea on this .. https://stackoverflow.com/questions/72401301/how-to-find-the-pdf-page-origin-for-an-existing-page – tejas Reddy p May 27 '22 at 06:29

0 Answers0