0

I am looking for a way to create a function that forms words from LChar elements from a PDF. After printing out the objects of the PDF I noticed there are no LTextBox elements or anything like that. I want to get the coordinates of text but they only option is using LChar and LChar only gives me coordinates of each letter. The letters nicely line up and there must be a way to create a dictionary to house these items and to apply a search to find coordinates. Sorry, I’m trying to explain as best I can.

For example: I got this code from another StackOverflow and had to use LChar instead of what’s there: How to extract text and text coordinates from a PDF file?

An example output would look something like this:

18, 26 F
20, 26 u
22, 26 n
30, 50 
23, 64 h
25, 64 e
28, 64 l
30, 64 l
32, 64 o

Etc.

Now, as you can see the word ‘fun’ is found on the same y-axis (26) but all have different y points. What I'm looking to do is get a dictionary that looks something like this:

myDict = {'minXcoord': 18, 'maxXcoord': 22, 'Ycoord': 26, 'text': fun}

*this would also get looped because there are multiple instances of this case

Is it possible to incorporate this?

Thanks in advance!

wpnewbie
  • 53
  • 11

1 Answers1

0

Instead of printing the line like the other SO post, you could create a dictionary based on the LTTextBoxHorizontal object like this:

if isinstance(obj, LTTextBoxHorizontal):
    my_dict = {
        'min_x': obj.x0, 
        'max_x': obj.x1, 
        'min_y': obj.y0, 
        'max_y': obj.y1, 
        'text': obj.get_text()
    }
Pieter
  • 3,262
  • 1
  • 17
  • 27