5

In Unix or Windows, I want to convert this dictionary to a Python dictionary. I copied the contents of the PDF dictionary and put them in a .rtf file, intending to read them with Python. However, it gives something like:

A /e/ noun a human blood type of the ABO system, containing the A antigen (NOTE: Some- one with type A can donate to people of the same group or of the AB group, and can receive blood from people with type A or type O.)
AA
abdominal distension /bdɒmn(ə)l ds tenʃ(ə)n/ noun a condition in which the abdo-
men is stretched because of gas or fluid
A
abdominal distension
AA abbr Alcoholics Anonymous

It has essentially squashed the columns from the PDF into a strange mismash. How do I convert a PDF to text so that the columns are respected? In other words, the desired output is:

A /e/ noun a human blood type of the ABO system, containing the A antigen (NOTE: Some- one with type A can donate to people of the same group or of the AB group, and can receive blood from people with type A or type O.)
AA abbr Alcoholics Anonymous

...and so on

Kurt Pfeifle
  • 86,724
  • 23
  • 248
  • 345
Sam Weisenthal
  • 2,791
  • 9
  • 28
  • 66
  • I can probably help, but I'm unable to download your linked PDF. Download aborts after 3-5 MByte each time (the complete file seems to be around 14 MByte). Can you provide a smaller sample PDF that is only 1 page, please? – Kurt Pfeifle Mar 29 '15 at 23:47

4 Answers4

7

You have basically two options to get to the text:

  1. Direct text extraction from each page as-is.
  2. Split each page into two along the column space and extract the text from each half separately

For the first option I'll suggest you first try pdftotext, but with the parameter -layout. (There are other tools, such as TET, the Text Extraction Toolkit from the PDFlib folks, which you can try if pdftotext isn't good enough.)

For following the road of the second option using Ghostscript and other tools, you may want check out my answers to the following questions:


pdftotext -layout

You can try it with the command line tool pdftotext. You'll have to decide if it is "good enough" for your purpose.

The following command extracts the text from page 8 only (first page with dual column layout) and prints it to <stdout>:

$ pdftotext -f 8 -l 8 -layout                                         \
           Dictionary+of+Medical+Terms+4th+Ed.-+\(Malestrom\).pdf - \
 | head -n 30

results in:

Medicine.fm Page 1 Thursday, November 20, 2003 4:26 PM

                                                          A
 A /e/ noun a human blood type of the ABO                abdominal distension /bdɒmn(ə)l ds
 A                                                        abdominal distension
 system, containing the A antigen (NOTE: Some-              tenʃ(ə)n/ noun a condition in which the abdo-
 one with type A can donate to people of the              men is stretched because of gas or fluid
 same group or of the AB group, and can receive           abdominal pain /b dɒmn(ə)l pen/ noun
                                                          abdominal pain
 blood from people with type A or type O.)                pain in the abdomen caused by indigestion or
 AA
 AA abbr Alcoholics Anonymous                             more serious disorders
 A & E /e ənd  i
                     /, A & E department /e ənd           abdominal viscera /bdɒmn(ə)l    vsərə/
 A & E                                                    abdominal viscera
    i
      d pɑ
           tmənt/ noun same as accident and
                                                          plural noun the organs which are contained in
 emergency department                                     the abdomen, e.g. the stomach, liver and intes-
 A & E medicine /e ənd     i
                              med(ə)sn/
 A & E medicine
                                                          tines
                                                          abdominal wall /b dɒmn(ə)l wɔ
                                                                                        l/ noun
                                                          abdominal wall
 noun the medical procedures used in A & E de-                                                            
 partments                                                muscular tissue which surrounds the abdomen
                                                          abdomino- /bdɒmnəυ/ prefix referring to
                                                          abdomino-

Note the use of -layout! Without it, the extracted text would look like this:

Medicine.fm Page 1 Thursday, November 20, 2003 4:26 PM A A /e/ noun a human blood type of the ABO system, containing the A antigen (NOTE: SomeA

one with type A can donate to people of the same group or of the AB group, and can receive blood from people with type A or type O.) AA abbr Alcoholics Anonymous A & E /e ənd i /, A & E department /e ənd i d pɑ tmənt/ noun same as accident and emergency department A & E medicine /e ənd i med(ə)sn/ noun the medical procedures used in A & E deAA

A & E A & E medicine partments AB /e bi / noun a human blood type of the ABO system, containing the A and B antigens AB

I noted that the file uses on page 8, but has not embedded, the fonts Courier, Helvetica, Helvetica-Bold, Times-Roman and Times-Italic.

This does not pose a problem for text extraction, since all these fonts use /WinAnsiEncoding.

However, there are other fonts, which are embedded as a subset. These fonts do use a /Custom encoding, but they do not provide a /ToUnicode table. This table is required for reliable text extraction (back-translating the glyph names to character names).

What I said can be seen in this table:

$ pdffonts -f 8 -l 8 Dictionary+of+Medical+Terms+4th+Ed.-+\(Malestrom\).pdf 
 name                           type        encoding      emb sub uni object ID
 ------------------------------ ----------- ------------- --- --- --- ---------
 Helvetica-Bold                 Type 1      WinAnsi       no  no  no    1505  0
 Courier                        Type 1      WinAnsi       no  no  no    1507  0
 Helvetica                      Type 1      WinAnsi       no  no  no    1497  0
 MOEKLA+Times-PhoneticIPA       Type 1C     Custom        yes yes yes   1509  0
 Times-Roman                    Type 1      WinAnsi       no  no  no    1506  0
 Times-Italic                   Type 1      WinAnsi       no  no  no    1499  0
 IGFBAL+EuropeanPi-Three        Type 1C     Custom        yes yes no    1502  0

It so happened that I recently hand-coded 5 different PDF files, with commented source code, for a new GitHub project. These 5 files demonstrate the importance of a correct /ToUnicode table for each font that is embedded as a subset. They can be found here, along with a README that explains some more detail

Community
  • 1
  • 1
Kurt Pfeifle
  • 86,724
  • 23
  • 248
  • 345
  • Running the following command on my terminal gives me errors that pdftotext command doesn't exit $ pdftotext -f 8 -l 8 -layout \ Dictionary+of+Medical+Terms+4th+Ed.-+\(Malestrom\).pdf - \ | head -n 30 This is odd, as I have installed pdftotext, and when running it within a python script works fine. But I also have pdf files that are two columns (such as scientific articles) and want to read them all as one column. Could you help me with this please? Thank you in advance. – HR123r Feb 26 '21 at 00:00
  • 1
    @HR123r: You cannot read 2-col PDF documents "all as one column" with *pdftotext*. You'll get the text in two cols as well, and you'll have to massage it with other text tools to separate the columns again (for example *cut*). – Kurt Pfeifle Feb 28 '21 at 17:11
2

You can use pdfminer to extract text from a PDF: http://www.unixuser.org/~euske/python/pdfminer/

Scott Hunter
  • 48,888
  • 12
  • 60
  • 101
1

PDF documents have very little notion of document structure. A PDF content stream includes instructions for placing glyphs on a page, but the order of placement does not have to correspond to the document structure.

You do not state what platform you are using. If you are using OS X, you may be able to use PDFKit to achieve what you want.

0
I have solved this issue with R. May be it has small bugs which can be corrected to your needs.

    countWhiteSpaces <-
  function(x)
    attr(gregexpr("(?<=[^ ])[ ]+(?=[^ ])", x, perl = TRUE)[[1]], "match.length")

getColumnCount <- function(path){
  library(pdftools)
  x <- pdf_text(path)
  write.csv(x,"data.txt")
  res <- readLines("data.txt")

  yy <- c()
  for(i in seq(1:length(res))){
    y = as.list(countWhiteSpaces(res[i]))
    yy[i]= length(y[y > 1])

  }
  li = list(colsInPdf= 1+as.integer(names(sort(table(yy), decreasing=T)[1])),lines = res)
  return(li)
}

result <- getColumnCount("pathToPdfFile.pdf")
lines <- result$lines
sizeOfText <- length(lines)
colsInPdf <- result$colsInPdf
df <- data.frame(matrix(ncol = result$colsInPdf, nrow = 0))
df <- df[1,]


for(i in seq(1:sizeOfText)){
  line = lines[i]
  y = as.list(countWhiteSpaces(line))
  yy = length(y[y > 1])
  t = as.list(strsplit(line, '\\s{2,}')[[1]])
  if(t[1]==""){t=t[-1]}
  t = unlist(t)
  if(length(t)==colsInPdf){
    df <- rbind(df, t)
  }

}
df = paste(df,collapse = " ")

Clean_String <- function(string){
  # Lowercase
  temp <- tolower(string)
  # Remove everything that is not a number or letter (may want to keep more 
  # stuff in your actual analyses). 
  temp <- stringr::str_replace_all(temp,"[^a-zA-Z\\s]", " ")
  # Shrink down to just one white space
  temp <- stringr::str_replace_all(temp,"[\\s]+", " ")
  # Split it
  temp <- stringr::str_split(temp, " ")[[1]]
  temp <- gsub(",", " ",temp)
  # Get rid of trailing "" if necessary
  indexes <- which(temp == "")
  if(length(indexes) > 0){
    temp <- temp[-indexes]
  } 
  return(temp)
}

toString(Clean_String(df))
pyBug
  • 1
  • 2