PDFminer: extract text with its font information

Question

I find this question, but it uses command line, and I do not want to call a Python script in command line using subprocess and parse HTML files to get the font information.

I want to use PDFminer as a library, and I find this question, but they are just all about extracting plain texts, without other information such as font name, font size, and so on.

Very interesting question, did you ever figure this out? – LBes Nov 09 '18 at 14:30 — LBes, Nov 09 '18 at 14:30

score 8 · Answer 1 · answered Sep 16 '17 at 17:51

#!/usr/bin/env python
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.layout import LAParams
from pdfminer.converter import PDFPageAggregator
import pdfminer


def createPDFDoc(fpath):
    fp = open(fpath, 'rb')
    parser = PDFParser(fp)
    document = PDFDocument(parser, password='')
    # Check if the document allows text extraction. If not, abort.
    if not document.is_extractable:
        raise "Not extractable"
    else:
        return document


def createDeviceInterpreter():
    rsrcmgr = PDFResourceManager()
    laparams = LAParams()
    device = PDFPageAggregator(rsrcmgr, laparams=laparams)
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    return device, interpreter


def parse_obj(objs):
    for obj in objs:
        if isinstance(obj, pdfminer.layout.LTTextBox):
            for o in obj._objs:
                if isinstance(o,pdfminer.layout.LTTextLine):
                    text=o.get_text()
                    if text.strip():
                        for c in  o._objs:
                            if isinstance(c, pdfminer.layout.LTChar):
                                print "fontname %s"%c.fontname
        # if it's a container, recurse
        elif isinstance(obj, pdfminer.layout.LTFigure):
            parse_obj(obj._objs)
        else:
            pass


document=createPDFDoc("/tmp/simple.pdf")
device,interpreter=createDeviceInterpreter()
pages=PDFPage.create_pages(document)
interpreter.process_page(pages.next())
layout = device.get_result()


parse_obj(layout._objs)

This works for getting the font name, but not font size or other attributes (italics, bold, etc.) — Agargara, Nov 01 '18 at 00:54
@Agargara did you find a way to get `font size` and perhaps other properties from pdf? — Pramesh Bajracharya, Jan 07 '19 at 05:39
@PrameshBajracharya I ended up editing the pdfminer source to get the font size. See: https://github.com/pdfminer/pdfminer.six/issues/202 However, note that this value still might not be the actual font size because of superscripts, etc. — Agargara, Jan 08 '19 at 00:24

score 6 · Answer 2 · answered Nov 14 '21 at 11:17

Full disclosure, I am one of the maintainers of pdfminer.six. It is a community-maintained version of pdfminer for python 3.

Nowadays, pdfminer.six has multiple API's to extract text and information from a PDF. For programmatically extracting information I would advice to use extract_pages(). This allows you to inspect all of the elements on a page, ordered in a meaningful hierarchy created by the layout algorithm.

The following example is a pythonic way of showing all the elements in the hierachy. It uses the simple1.pdf from the samples directory of pdfminer.six.

from pathlib import Path
from typing import Iterable, Any

from pdfminer.high_level import extract_pages


def show_ltitem_hierarchy(o: Any, depth=0):
    """Show location and text of LTItem and all its descendants"""
    if depth == 0:
        print('element                        fontname             text')
        print('------------------------------ -------------------- -----')

    print(
        f'{get_indented_name(o, depth):<30.30s} '
        f'{get_optional_fontinfo(o):<20.20s} '
        f'{get_optional_text(o)}'
    )

    if isinstance(o, Iterable):
        for i in o:
            show_ltitem_hierarchy(i, depth=depth + 1)


def get_indented_name(o: Any, depth: int) -> str:
    """Indented name of class"""
    return '  ' * depth + o.__class__.__name__


def get_optional_fontinfo(o: Any) -> str:
    """Font info of LTChar if available, otherwise empty string"""
    if hasattr(o, 'fontname') and hasattr(o, 'size'):
        return f'{o.fontname} {round(o.size)}pt'
    return ''


def get_optional_text(o: Any) -> str:
    """Text of LTItem if available, otherwise empty string"""
    if hasattr(o, 'get_text'):
        return o.get_text().strip()
    return ''


path = Path('~/Downloads/simple1.pdf').expanduser()
pages = extract_pages(path)
show_ltitem_hierarchy(pages)

The output shows the different elements in the hierarchy, the font name and size if available and the text that this element contains.

element                        fontname             text
------------------------------ -------------------- -----
generator                                           
  LTPage                                            
    LTTextBoxHorizontal                             Hello
      LTTextLineHorizontal                          Hello
        LTChar                 Helvetica 24pt       H
        LTChar                 Helvetica 24pt       e
        LTChar                 Helvetica 24pt       l
        LTChar                 Helvetica 24pt       l
        LTChar                 Helvetica 24pt       o
        LTChar                 Helvetica 24pt       
        LTAnno                                      
    LTTextBoxHorizontal                             World
      LTTextLineHorizontal                          World
        LTChar                 Helvetica 24pt       W
        LTChar                 Helvetica 24pt       o
        LTChar                 Helvetica 24pt       r
        LTChar                 Helvetica 24pt       l
        LTChar                 Helvetica 24pt       d
        LTAnno                                      
    LTTextBoxHorizontal                             Hello
      LTTextLineHorizontal                          Hello
        LTChar                 Helvetica 24pt       H
        LTChar                 Helvetica 24pt       e
        LTChar                 Helvetica 24pt       l
        LTChar                 Helvetica 24pt       l
        LTChar                 Helvetica 24pt       o
        LTChar                 Helvetica 24pt       
        LTAnno                                      
    LTTextBoxHorizontal                             World
      LTTextLineHorizontal                          World
        LTChar                 Helvetica 24pt       W
        LTChar                 Helvetica 24pt       o
        LTChar                 Helvetica 24pt       r
        LTChar                 Helvetica 24pt       l
        LTChar                 Helvetica 24pt       d
        LTAnno                                      
    LTTextBoxHorizontal                             H e l l o
      LTTextLineHorizontal                          H e l l o
        LTChar                 Helvetica 24pt       H
        LTAnno                                      
        LTChar                 Helvetica 24pt       e
        LTAnno                                      
        LTChar                 Helvetica 24pt       l
        LTAnno                                      
        LTChar                 Helvetica 24pt       l
        LTAnno                                      
        LTChar                 Helvetica 24pt       o
        LTAnno                                      
        LTChar                 Helvetica 24pt       
        LTAnno                                      
    LTTextBoxHorizontal                             W o r l d
      LTTextLineHorizontal                          W o r l d
        LTChar                 Helvetica 24pt       W
        LTAnno                                      
        LTChar                 Helvetica 24pt       o
        LTAnno                                      
        LTChar                 Helvetica 24pt       r
        LTAnno                                      
        LTChar                 Helvetica 24pt       l
        LTAnno                                      
        LTChar                 Helvetica 24pt       d
        LTAnno                                      
    LTTextBoxHorizontal                             H e l l o
      LTTextLineHorizontal                          H e l l o
        LTChar                 Helvetica 24pt       H
        LTAnno                                      
        LTChar                 Helvetica 24pt       e
        LTAnno                                      
        LTChar                 Helvetica 24pt       l
        LTAnno                                      
        LTChar                 Helvetica 24pt       l
        LTAnno                                      
        LTChar                 Helvetica 24pt       o
        LTAnno                                      
        LTChar                 Helvetica 24pt       
        LTAnno                                      
    LTTextBoxHorizontal                             W o r l d
      LTTextLineHorizontal                          W o r l d
        LTChar                 Helvetica 24pt       W
        LTAnno                                      
        LTChar                 Helvetica 24pt       o
        LTAnno                                      
        LTChar                 Helvetica 24pt       r
        LTAnno                                      
        LTChar                 Helvetica 24pt       l
        LTAnno                                      
        LTChar                 Helvetica 24pt       d
        LTAnno

(Similar answer here, here and here , I'll try to keep them in sync.)

What about "boldness" or "italicness", is information available about that? Sometimes you can get some info through the font name, but when font names are embedded in a PDF, they usually have coded names. So, how can we can we get information about "boldness" or "italicness" in a generic way from a PDF? (Other then using computer vision). — Visionscaper, Feb 19 '23 at 13:59

score 3 · Answer 3 · edited May 23 '17 at 12:10

3

This approach does not use PDFMiner but does the trick.

First, convert the PDF document into docx. Using python-docx you can then retrieve font information. Here's an example of getting all the bold text

from docx import *

document = Document('/path/to/file.docx')

for para in document.paragraphs:
    for run in para.runs:
        if run.bold:
            print run.text

If you really want to use PDFMiner you can try this. Passing '-t' would convert the PDF into HTML with all the font information.

edited May 23 '17 at 12:10

Community

1
1

answered Mar 26 '17 at 08:55

Samkit Jain

1,560
2
16
33

2

But, while converting PDF to docx, it may loose information of that? – Gunjan naik Sep 27 '18 at 08:19
3

Is there any reliable library for converting PDF to Docx? – Sandip Kumar Sep 12 '19 at 11:55

score 3 · Answer 4 · answered May 18 '20 at 08:02

I hope this could help you :)

Get the font-family:

if isinstance(c, pdfminer.layout.LTChar):
    print (c.fontname)

Get the font-size:

if isinstance(c, pdfminer.layout.LTChar):
    print (c.size)

Get the font-positon:

if isinstance(c, pdfminer.layout.LTChar):
    print (c.bbox)

Get the info of image:

if isinstance(obj, pdfminer.layout.LTImage):
outputImg = "<Image>\n"
outputImg += ("name: %s, " % obj.name)
outputImg += ("x: %f, " % obj.bbox[0])
outputImg += ("y: %f\n" % obj.bbox[1])
outputImg += ("width1: %f, " % obj.width)
outputImg += ("height1: %f, " % obj.height)
outputImg += ("width2: %f, " % obj.stream.attrs['Width'])
outputImg += ("height2: %f\n" % obj.stream.attrs['Height'])
print (outputImg)

Interesting, can you please provide full code snippet? The variable `c` is not understood. — Mohith7548, Jun 28 '21 at 06:17

score 2 · Answer 5 · answered May 11 '20 at 11:56

If you want to get the font size or font name from a PDF file using PDF miner library you have to interpret the whole pdf page. You should decide for which word, phrase do you want to get font size and font name(as on a page you can have multiple words with different font sizes). The structure using PDF miner for a page: PDFPageInterpreter -> LTTextBox -> LTChar Once you found out for which word you want to get font size you call: size method for font size(which actually is height), and fontname for font. Code should look like this, you pass the pdf file path, word for which you want to get font size and the page number(on which page is the searched word):

def get_fontsize_and_fontname_for_word(self, pdf_path, word, page_number):
    resource_manager = PDFResourceManager()
    layout_params = LAParams()
    device = PDFPageAggregator(resource_manager, laparams=layout_params)
    pdf_file = file(pdf_path, 'rb')
    pdf_page_interpreter = PDFPageInterpreter(resource_manager, device)
    global actual_font_size_pt, actual_font_name

    for current_page_number, page in enumerate(PDFPage.get_pages(pdf_file)):
        if current_page_number == int(page_number) - 1:
            pdf_page_interpreter.process_page(page)
            layout = device.get_result()
            for textbox_element in layout:
                if isinstance(textbox_element, LTTextBox):
                    for line in textbox_element:
                        word_from_textbox = line.get_text().strip()
                        if word in word_from_textbox:
                            for char in line:
                                if isinstance(char, LTChar):
                                    # convert pixels to points
                                    actual_font_size_pt = int(char.size) * 72 / 96
                                    # remove prefixed font name, such as QTBAAA+
                                    actual_font_name = char.fontname[7:]
    pdf_file.close()
    device.close()
    return actual_font_size_pt, actual_font_name

You could check what other properties LTChar class supports

Can you please help me understand how have you arrived at the formula for calculating acutual_font_size_pt? — Potatojaisiladki, Apr 27 '21 at 19:30
I've converted pixels to points, points = pixels * 72 / 96 , check https://stackoverflow.com/questions/139655/convert-pixels-to-points — adambogdan1993, Sep 23 '21 at 09:38

score 1 · Answer 6 · answered Jan 05 '16 at 08:10

1

Have a look at PDFlib, it can extract font info as you require and has a Python library you can import in your scripts and work with it.

answered Jan 05 '16 at 08:10

gplayer

1,741
1
14
15

score 0 · Answer 7 · answered Sep 23 '19 at 12:46

Some informations are in lower level, in the LTChar class. It seems logic because font size, italic, bold, etc, can be applied to a single character.

More infos here : https://github.com/euske/pdfminer/blob/master/pdfminer/layout.py#L222

But I'm still confuse about font color not in this class

PDFminer: extract text with its font information

7 Answers7

Linked