how to get specified text pos through xpdf or mupdf?

Question

I want to extract some specified text in pdf files and the text position.

I know xpdf and mupdf can parse pdf files,so i think they may help me to fulfill this task.

But how to use these two lib to get text position?

@DanD.Text position means the first character position in the page. — PDF1001, Sep 23 '11 at 01:29

score 3 · Answer 1 · answered Jan 15 '18 at 22:04

3

If you don't mind using a Python binding for MuPDF, here is a Python solution using PyMuPDF (I am one of its developers):

import fitz                     # the PyMuPDF module
doc = fitz.open("input.pdf")    # PDF input file
page = doc[n]                   # page number n (0-based)
wordlist = page.getTextWords()  # gives you a list of all words on the
# page, together with their position info (a rectangle containing the word)

# or, if you only are interested in blocks of lines belonging together:
blocklist = page.getTextBlocks()

# If you need yet more details, use a JSON-based output, which also gives
# images and their positions, as well as font information for the text.
tdict = json.loads(page.getText("json"))

We are on GitHub if you are interested.

answered Jan 15 '18 at 22:04

Jorj McKie

2,062
1
13
17

if you don't need all words, will it be faster to use `search` method that also provides position, than use `extractWORDS()`? – edvard_munch Jan 18 '21 at 15:39
1

@edvard_munch - not really, because both methods create the same intermediate object ``TextPage`` and then perform their specific tasks using properties of that object. The textpage does all the work of extracting (text **and** images as required). – Jorj McKie Jan 19 '21 at 18:14
As per my original answer, please also note, that there is a new option for ``page.getText("dict")`` which directly produces a Python dictionary -- instead using an intermediate JSON. So the respective statement would just be ``tdict = page.getText("dict")``, which needs no JSON and is several times faster ... – Jorj McKie Jan 19 '21 at 18:18

score 1 · Accepted Answer · edited Dec 02 '11 at 09:46

1

Mupdf comes with a couple of tools, one being pdfdraw.

If you use pdfdraw with the -tt option, it will generate an XML containing all characters and their exact positioning information.
From there you should be able to find what you need.

edited Dec 02 '11 at 09:46

Chandra Sekhar

16,256
10
67
90

answered Dec 02 '11 at 09:29

Robert

56
3

In newer versions it is called mudraw.c and the trail leads to structured-text.h and stext-output.c, very helpful, thanks. – Andrew Apr 30 '14 at 13:04

how to get specified text pos through xpdf or mupdf?

2 Answers2