I am using the pdf2json library to parse a pdf.
It is returning the parsed data in a json and I've attached some sample data.
The main variable to keep note of are
Height - The height of the pdf in PAGE_UNITS
Width - The width of the pdf in PAGE_UNITS
sw - (space width of the font) Defined in the README.md of the pd2json library
TS at index 1 - font size in pt
w - Where my confusion is happening. W is supposed to represent the width of the line of text. However, my line of text has a greater width than the width of the page which doesn't make any sense.
I need to get the length of the text. I've tried doing (number of chars in text * sw)/pagewidth to get the ratio of the line relative to the pdf.Tp test I have then used that ratio in my frontend to draw over an image of the same pdf over the specific line.
But this doesn't seem to be giving me the correct length of the line. Usually it is too short.
If anyone could please help that would be super appreciated. I've been going through the pd2json issues searching for something similar however there have been no answers and the library doesn't appear to be supported all that well.
"Pages": [
{
"Height": 49.5,
"HLines": [],
"VLines": [],
"Fills": [
{
"x": 0,
"y": 0,
"w": 0,
"h": 0,
"clr": 1
},
{
"x": 9.001,
"y": 19.271,
"w": 5.372,
"h": 0.038,
"clr": 35
}
],
"Texts": [
{
"x": 4.252,
"y": 45.981,
"w": 96.648,
"sw": 0.32553125,
"clr": 0,
"A": "left",
"R": [
{
"T": "Hello%20World%20",
"S": -1,
"TS": [
0,
15,
0,
0
]
}
]
},
"Width": 38.25
...