4

I am trying to extract bold text elements from PDFs using PyMUPDF 1.18.14. I was hoping that this would work as I understand from the docs that flags=4 targets bold font.

page = doc[1]
text = page.get_text(flags=4)
print(text)

But it prints out all text on the page and not just bold text.

When using the TextPage.extractDICT() (or Page.get_text(“dict”)) like this:-

page.get_text("dict", flags=11)["blocks"]

The flag works but I am having trouble understanding what it is doing. Maybe switching between image and text blocks.

Span

So it seems you have to get to the span to be able to access flags.

<page>
    <text block>
        <line>
            <span>
                <char>
    <image block>
        <img>

enter image description here

So you can then do something like this, I used flags=20 on the span tag to get the bold font.

page = doc[1]
blocks = page.get_text("dict", flags=11)["blocks"]
for b in blocks:  # iterate through the text blocks
    for l in b["lines"]:  # iterate through the text lines
        for s in l["spans"]:  # iterate through the text spans
            if s["flags"] == 20:  # 20 targets bold
                print(s)

But this seems like a long away around.

So my question is this the best way of finding bold elements or is there something I am missing?

Would be great to be able to search for bold elements using page.search_for()

Cam
  • 1,263
  • 13
  • 22
  • 1
    I am looking to do something similar. Your Q was surprisingly useful in helping me navigate the documentation. So thanks. From what I can see there is no simpler way. – RodP Feb 12 '22 at 10:49

0 Answers0