I want to extract some specified text in pdf files and the text position.
I know xpdf and mupdf can parse pdf files,so i think they may help me to fulfill this task.
But how to use these two lib to get text position?
I want to extract some specified text in pdf files and the text position.
I know xpdf and mupdf can parse pdf files,so i think they may help me to fulfill this task.
But how to use these two lib to get text position?
If you don't mind using a Python binding for MuPDF, here is a Python solution using PyMuPDF (I am one of its developers):
import fitz # the PyMuPDF module
doc = fitz.open("input.pdf") # PDF input file
page = doc[n] # page number n (0-based)
wordlist = page.getTextWords() # gives you a list of all words on the
# page, together with their position info (a rectangle containing the word)
# or, if you only are interested in blocks of lines belonging together:
blocklist = page.getTextBlocks()
# If you need yet more details, use a JSON-based output, which also gives
# images and their positions, as well as font information for the text.
tdict = json.loads(page.getText("json"))
We are on GitHub if you are interested.
Mupdf comes with a couple of tools, one being pdfdraw
.
If you use pdfdraw with the -tt
option, it will generate an XML
containing all characters and their exact positioning information.
From there you should be able to find what you need.