3

I am using R for extracting text. The code below works well to extract the non-bold text from pdf but it ignores the bold part. Is there a way to extract both bold and non-bold text?

 news <-'http://www.frbe-kbsb.be/sites/manager/ICN/14-15/ind01.pdf'
 library(pdftools)
 library(tesseract)
 library(tiff)
 info <- pdf_info(news)
 numberOfPageInPdf <- as.numeric(info[2])
 numberOfPageInPdf
 for (i in 1:numberOfPageInPdf){
      bitmap <- pdf_render_page(news, page=i, dpi = 300, numeric = TRUE)
      file_name <- paste0("page", i, ".tiff") 
      file_tiff <- tiff::writeTIFF(bitmap, file_name)
      out <- ocr(file_name)
      file_txt <- paste0("text", i, ".txt") 
      writeLines(out, file_txt)
    }
emeryville
  • 332
  • 1
  • 4
  • 19
  • Try this updated answer: https://stackoverflow.com/questions/53398611/how-to-extract-bold-text-from-a-pdf-using-r/67963468#67963468 – venrey Jun 14 '21 at 00:13

2 Answers2

2

I like using the tabulizer library for this. Here's a small example:

devtools::install_github("ropensci/tabulizer")
library(tabulizer)

news <-'http://www.frbe-kbsb.be/sites/manager/ICN/14-15/ind01.pdf'

# note that you need to specify UTF-8 as the encoding, otherwise your special characters
# won't come in correctly

page1 <- extract_tables(news, guess=TRUE, page = 1, encoding='UTF-8')

page1[[1]]

      [,1] [,2]                    [,3]       [,4]                [,5]    [,6]                [,7]      
 [1,] ""   "Division: 1"           ""         ""                  ""      ""                  "Série: A"
 [2,] ""   "514"                   ""         "Fontaine 1 KBSK 1" ""      ""                  "303"     
 [3,] "1"  "62529 WIRIG ANTHONY"   ""         "2501 1⁄2-1⁄2"      "51560" "CZEBE ATTILLA"     "2439"    
 [4,] "2"  "62359 BRUNNER NICOLAS" ""         "2443 0-1"          "51861" "PICEU TOM"         "2401"    
 [5,] "3"  "75655 CEKRO EKREM"     ""         "2393 0-1"          "10391" "GEIRNAERT STEVEN"  "2400"    
 [6,] "4"  "50211 MARECHAL ANDY"   ""         "2355 0-1"          "35181" "LEENHOUTS KOEN"    "2388"    
 [7,] "5"  "73059 CLAESEN PIETER"  ""         "2327 1⁄2-1⁄2"      "25615" "DECOSTER FREDERIC" "2373"    
 [8,] "6"  "63614 HOURIEZ CLEMENT" ""         "2304 1⁄2-1⁄2"      "44954" "MAENHOUT THIBAUT"  "2372"    
 [9,] "7"  "60369 CAPONE NICOLA"   ""         "2283 1⁄2-1⁄2"      "10430" "VERLINDE TIEME"    "2271"    
[10,] "8"  "70653 LE QUANG KIM"    ""         "2282 0-1"          "44636" "GRYSON WOUTER"     "2269"    
[11,] ""   ""                      "< 2361 >" "12 - 20"           ""      "< 2364 >"          ""      

You can also use the locate_areas function to specify a specific region if you only care about some of the tables. Note that for locate_areas to work, I had to download the file locally first; using the URL returned an error.

You'll note that each table is its own element in the returned list.

Here's an example using a custom region to just select the first table on each page:

customArea <- extract_tables(news, guess=FALSE, page = 1, area=list(c(84,27,232,569), encoding = 'UTF-8')

This is also a more direct method than using the OCR (Optical Character Recognition) library tesseract beacuse you're not relying on the OCR library to translate pixel arrangement back into text. In digital PDFs, each text element has an x and y position, and the tabulizer library uses that information to detect table heuristics and extract sensibly formatted data. You'll see you still have some clean up to do, but it's pretty manageable.

Edit: just for fun, here's a little example of starting the clean up with data.table

library(data.table)

cleanUp <- setDT(as.data.frame(page1[[1]]))

cleanUp[ ,  `:=` (Division = as.numeric(gsub("^.*(\\d+{1,2}).*", "\\1", grep('Division', cleanUp$V2, value=TRUE))),
  Series = as.character(gsub(".*:\\s(\\w).*","\\1", grep('Série:', cleanUp$V7, value=TRUE))))
  ][,ID := tstrsplit(V2," ", fixed=TRUE, keep = 1)
  ][, c("V1", "V3") := NULL
  ][-grep('Division', V2, fixed=TRUE)]

Here we've moved Division, Series, and ID into their own columns, and removed the Division header row. This is just the general idea, and would need a little refinement to apply to all 27 pages.

                       V2                V4    V5                V6   V7 Division Series    ID
 1:                   514 Fontaine 1 KBSK 1                          303        1      A   514
 2:   62529 WIRIG ANTHONY      2501 1/2-1/2 51560     CZEBE ATTILLA 2439        1      A 62529
 3: 62359 BRUNNER NICOLAS          2443 0-1 51861         PICEU TOM 2401        1      A 62359
 4:     75655 CEKRO EKREM          2393 0-1 10391  GEIRNAERT STEVEN 2400        1      A 75655
 5:   50211 MARECHAL ANDY          2355 0-1 35181    LEENHOUTS KOEN 2388        1      A 50211
 6:  73059 CLAESEN PIETER      2327 1/2-1/2 25615 DECOSTER FREDERIC 2373        1      A 73059
 7: 63614 HOURIEZ CLEMENT      2304 1/2-1/2 44954  MAENHOUT THIBAUT 2372        1      A 63614
 8:   60369 CAPONE NICOLA      2283 1/2-1/2 10430    VERLINDE TIEME 2271        1      A 60369
 9:    70653 LE QUANG KIM          2282 0-1 44636     GRYSON WOUTER 2269        1      A 70653
10:                                 12 - 20                < 2364 >             1      A    NA
Mako212
  • 6,787
  • 1
  • 18
  • 37
  • I got troubles installing. Here is the solution the package tabulizer `library(devtools)` `devtools::install_github("ropensci/tabulizer")` – emeryville Mar 28 '18 at 20:59
  • 1
    @emeryville thanks, I forgot that it's not available on cran.r. Updated my answer to reflect that – Mako212 Mar 28 '18 at 21:21
  • I had some hope but unfortunately my solution to install tabulizer didn't work, this one [either](https://stackoverflow.com/questions/39132202/trouble-installing-tabulizer-package?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa) – emeryville Mar 28 '18 at 21:51
  • @emeryville you might also need to install Java, see the Installation section here: https://github.com/ropensci/tabulizer – Mako212 Mar 28 '18 at 22:29
1

There is no need to go through the PDF -> TIFF -> OCR loop, since pdftools::pdf_text() can read this file directly:

stringi::stri_split(pdf_text(news), regex = "\n")
Ralf Stubner
  • 26,263
  • 3
  • 40
  • 75
  • I still can't find an elegant way to extract bold text from pdf files? I don't think we need to complicate it with tabuliser & locate areas etc. – Lazarus Thurston Nov 15 '18 at 14:54
  • @SanjayMehrotra I don't understand the context of your comment. Feel free to post a new question if the solutions provided here do not fit your problem. – Ralf Stubner Nov 15 '18 at 15:02
  • Question: How to identify bold text in a pdf file using R but not using tabulizer package. Reason: I feel Tabuliser is an overkill if there are no tables in the file. For just plain text paragraphs we need not identify the area of the text. Is there any other package or is there a straight forward way in tabuliser to extract bold characters? Thanks. – Lazarus Thurston Nov 20 '18 at 04:49
  • @SanjayMehrotra From my experience `tabularizer` is great for extracting tabular material, even without specifying locate areas. However, if you are only interested in plain text pragraphs, `pdftools::pdf_text()` as used in my answer is probably sufficient. It does not differentiate font weight (bold, normal, light, ...) in any way, though. – Ralf Stubner Nov 20 '18 at 07:14
  • I needed bold text identification. Will try tabuliser if there's no other way. Last time I remember wasting a lot of time to setup the dependencies of tabuliser. Is it a smooth install on a mac? – Lazarus Thurston Nov 20 '18 at 10:01
  • @SanjayMehrotra `tabularizer` requires Java which seems to be a challenge on Mac OS. I hve not experienced any such problems on Linux. However, I very much doubt that `tabularizer` will be able to distinguish different font weights within a plain text paragraph. – Ralf Stubner Nov 20 '18 at 10:06
  • And then this means I will raise a fresh question on SO asking a non tabuliser solution to this bold detection. Are you ok with that @ralf – Lazarus Thurston Nov 20 '18 at 11:02
  • 1
    @SanjayMehrotra Asking a new question was my initial suggestion. But please keep in mind that questions asking for a tool / library are considered off-topic on SO. – Ralf Stubner Nov 20 '18 at 12:35