0

How to use Python to find the page number where a certain fonts is used in a pdf.

I tried in PYPDF2 library but not provided the expected output, For example Where Arial font is used, I want to print those page numbers.

Here is the MME

import PyPDF2

pdf_file_path = "input.pdf"
target_font = "Arial"

pdf = PyPDF2.PdfReader(open(pdf_file_path, "rb"))

# Iterate through the pages of the PDF
for page_number in range(len(pdf.pages)):
    page = pdf.pages[page_number]
    fonts = page['/Resources']['/Font']

    # Check if the target font is used on the page
    if any(target_font.lower() in font.lower() for font in fonts.keys()):
        print("Font", target_font, "is used on page", page_number + 1)
TeX_learner
  • 123
  • 6
  • Maybe that helps: https://stackoverflow.com/questions/34606382/pdfminer-extract-text-with-its-font-information – tfv May 27 '23 at 09:19
  • @tfv, i dont want to extract the text need to extract only font information – TeX_learner May 27 '23 at 09:28
  • 1
    For what it's worth, pdf files don't necessarily preserve information like what fonts were used. For example, a pdf file produced by Microsoft Print to PDF that used the Arial and Times New Roman fonts refers to the fonts as CIDFont+F1 and CIDFont+F2. – GordonAitchJay May 27 '23 at 10:09

2 Answers2

1

A solution using PyMuPDF:

import fitz  # PyMuPDF

target_font = "arial"

doc = fitz.open("input.pdf")
for page in doc:
    fontlist = page.get_fonts()
    for xref, ext, ftype, fontname, _, _ in fontlist:
        if target_font in fontname.lower():  # match may not be exact with subset fonts
            print(f"Page {page.number} uses {target_font}")
            break
Jorj McKie
  • 2,062
  • 1
  • 13
  • 17
0

Most copies of Python will have the poppler utilities which you can use via shell or simpler use the shell direct so here in windows this file will show it has arial in pages 3 and 6, You can use the output via a redirected list or other means at your disposal as required.

Here is a windows command line.

for /L %L in (1 1 6) do @pdffonts -f %L -l %L my1.pdf|find /i "arial"&if not errorlevel 1 echo Arial found on Page %L

Result for my1.pdf

PMGGAE+ArialMT                       CID TrueType      Identity-H       yes yes yes    306  0
Arial found on Page 3
PMGGAE+ArialMT                       CID TrueType      Identity-H       yes yes yes    306  0
Arial found on Page 6

You can run via batch file to add a loop of filenames but python is good for that itself. One limitation in the way I wrote a single line is you need to know how many pages in the file before query which would require a variable (easy in a different line or via Python query number of pages).

enter image description here

>type found.txt
Arial found on Page 3 of my1.pdf
Arial found on Page 6 of my1.pdf
K J
  • 8,045
  • 3
  • 14
  • 36