You have basically two options to get to the text:
- Direct text extraction from each page as-is.
- Split each page into two along the column space and extract the text from each half separately
For the first option I'll suggest you first try pdftotext
, but with the parameter -layout
. (There are other tools, such as TET
, the Text Extraction Toolkit from the PDFlib folks, which you can try if pdftotext
isn't good enough.)
For following the road of the second option using Ghostscript and other tools, you may want check out my answers to the following questions:
pdftotext -layout
You can try it with the command line tool pdftotext
. You'll have to decide if it is "good enough" for your purpose.
The following command extracts the text from page 8 only (first page with dual column layout) and prints it to <stdout>
:
$ pdftotext -f 8 -l 8 -layout \
Dictionary+of+Medical+Terms+4th+Ed.-+\(Malestrom\).pdf - \
| head -n 30
results in:
Medicine.fm Page 1 Thursday, November 20, 2003 4:26 PM
A
A /e/ noun a human blood type of the ABO abdominal distension /bdɒmn(ə)l ds
A abdominal distension
system, containing the A antigen (NOTE: Some- tenʃ(ə)n/ noun a condition in which the abdo-
one with type A can donate to people of the men is stretched because of gas or fluid
same group or of the AB group, and can receive abdominal pain /b dɒmn(ə)l pen/ noun
abdominal pain
blood from people with type A or type O.) pain in the abdomen caused by indigestion or
AA
AA abbr Alcoholics Anonymous more serious disorders
A & E /e ənd i
/, A & E department /e ənd abdominal viscera /bdɒmn(ə)l vsərə/
A & E abdominal viscera
i
d pɑ
tmənt/ noun same as accident and
plural noun the organs which are contained in
emergency department the abdomen, e.g. the stomach, liver and intes-
A & E medicine /e ənd i
med(ə)sn/
A & E medicine
tines
abdominal wall /b dɒmn(ə)l wɔ
l/ noun
abdominal wall
noun the medical procedures used in A & E de-
partments muscular tissue which surrounds the abdomen
abdomino- /bdɒmnəυ/ prefix referring to
abdomino-
Note the use of -layout
! Without it, the extracted text would look like this:
Medicine.fm Page 1 Thursday, November 20, 2003 4:26 PM
A
A /e/ noun a human blood type of the ABO
system, containing the A antigen (NOTE: SomeA
one with type A can donate to people of the
same group or of the AB group, and can receive
blood from people with type A or type O.)
AA abbr Alcoholics Anonymous
A & E /e ənd i
/, A & E department /e ənd
i
d pɑ
tmənt/ noun same as accident and
emergency department
A & E medicine /e ənd i
med(ə)sn/
noun the medical procedures used in A & E deAA
A & E
A & E medicine
partments
AB /e bi
/ noun a human blood type of the
ABO system, containing the A and B antigens
AB
I noted that the file uses on page 8, but has not embedded, the fonts Courier
, Helvetica
, Helvetica-Bold
, Times-Roman
and Times-Italic
.
This does not pose a problem for text extraction, since all these fonts use /WinAnsiEncoding
.
However, there are other fonts, which are embedded as a subset. These fonts do use a /Custom
encoding, but they do not provide a /ToUnicode
table. This table is required for reliable text extraction (back-translating the glyph names to character names).
What I said can be seen in this table:
$ pdffonts -f 8 -l 8 Dictionary+of+Medical+Terms+4th+Ed.-+\(Malestrom\).pdf
name type encoding emb sub uni object ID
------------------------------ ----------- ------------- --- --- --- ---------
Helvetica-Bold Type 1 WinAnsi no no no 1505 0
Courier Type 1 WinAnsi no no no 1507 0
Helvetica Type 1 WinAnsi no no no 1497 0
MOEKLA+Times-PhoneticIPA Type 1C Custom yes yes yes 1509 0
Times-Roman Type 1 WinAnsi no no no 1506 0
Times-Italic Type 1 WinAnsi no no no 1499 0
IGFBAL+EuropeanPi-Three Type 1C Custom yes yes no 1502 0
It so happened that I recently hand-coded 5 different PDF files, with commented source code, for a new GitHub project. These 5 files demonstrate the importance of a correct /ToUnicode
table for each font that is embedded as a subset. They can be found here, along with a README that explains some more detail