2

I wanted to highlighted some text in a PDF document using R. I want to search a PDF document for some text and highlight the text if found. I searched for packages which could do this.

pdftools and pdfsearch are packages which help in handling PDF documents. These packages mainly handle converting pdf to text and doing any sort of manipulation.

Is there a way in which we can highlight a PDF document using R?

SBista
  • 7,479
  • 1
  • 27
  • 58
  • I don't think so. – lukeA Mar 08 '17 at 10:17
  • Maybe this helps: https://stackoverflow.com/questions/40288400/highlight-text-in-a-pdf-with-python – Martin Valgur Mar 08 '17 at 10:29
  • Thanks @MartinValgur. I'll see if there are better alternatives. Otherwise calling a python code from R would maybe solve my problem. – SBista Mar 08 '17 at 10:33
  • I ended up writing my own package to do this. If anyone needs this functionality its available on my [github page](https://github.com/Swechhya/pdfUtils). – SBista Oct 15 '17 at 14:54

1 Answers1

0

I was able to highlight some keywords in a PDF with the following code. There are four steps :

  1. Save wikipedia page to PDF;

  2. Convert the PDF to word document with the Word Software (There is an OCR!!);

  3. Highlight the keywords in the word document;

  4. Save the word document as PDF.

library(RDCOMClient)
library(DescTools)
library(pagedown)

#############################################
#### Step 1 : Save wikipedia page as PDF ####
#############################################
chrome_print(input = "https://en.wikipedia.org/wiki/Cat", 
             output = "C:\\Text_PDF_Cat.pdf")

path_PDF <- "C:\\Text_PDF_Cat.pdf"
path_Word <- "C:\\Text_PDF_Cat.docx"

################################################################
#### Step 2 : Convert PDF to word document with OCR of Word ####
################################################################
wordApp <- COMCreate("Word.Application")
wordApp[["Visible"]] <- TRUE
wordApp[["DisplayAlerts"]] <- FALSE

doc <- wordApp[["Documents"]]$Open(normalizePath(path_PDF),
                                   ConfirmConversions = FALSE)

doc$SaveAs2(path_Word)
doc_Selection <- wordApp$Selection()

######################################################
#### Step 3 : Highlight keywords in word document ####
######################################################
move_To_Beginning_Doc <- function(doc_Selection)
{
  doc_Selection$HomeKey(Unit = wdConst$wdStory) # Need DescTools for wdConst$wdStory
}

highlight_Text_Regex_Word <- function(doc,
                                      doc_Selection,
                                      words_To_Highlight, 
                                      colorIndex = 7,
                                      nb_Max_Word = 100)
{
  for(i in words_To_Highlight)
  {
    move_To_Beginning_Doc(doc_Selection)
    
    for(j in 1 : nb_Max_Word)
    {
      doc_Selection$Find()$Execute(FindText = i, MatchCase = FALSE)
      doc_Selection_Range <- doc_Selection$Range()
      doc_Selection_Range[["HighlightColorIndex"]] <- colorIndex
    }
  }
}

highlight_Text_Regex_Word(doc, doc_Selection,
                          words_To_Highlight = c("cat", "domestic", "quick"), 
                          colorIndex = 7, nb_Max_Word = 100)
  
###############################################
#### Step 4 : Convert word document to pdf ####
###############################################
path_PDF_Highlighted <- "C:\\Text_PDF_Cat_Highlighted.pdf"
wordApp[["ActiveDocument"]]$SaveAs(path_PDF_Highlighted, FileFormat = 17) # FileFormat = 17 saves as .PDF
doc$Close()
wordApp$Quit() # quit wordApp

Emmanuel Hamel
  • 1,769
  • 7
  • 19