Extract the paragraphs from a PDF that contain a keyword using R

Question

I need to extract from a pdf file the paragraphs that contain a keyword. Tried various codes but none got anything. I have seen this code from a user @Tyler Rinker (Extract before and after lines based on keyword in Pdf using R programming) but it extracts the line where the keyword is, the before and after.

library(textreadr)
library(tidyverse)

loc <- function(var, regex, n = 1, ignore.case = TRUE){
    locs <- grep(regex, var, ignore.case = ignore.case)
    out <- sort(unique(c(locs - 1, locs, locs + 1)))
    out <- out[out > 0]
    out[out <= length(var)]
}

doc <- 'https://www.in.kpmg.com/pdf/Indian%20Pharma%20Outlook.pdf' %>%
    read_pdf() %>%
    slice(loc(text, 'cancer'))

However, I need to get the paragraphs and store each one in a row in my database. Could you help me?

I think the problem is that in the document, the paragraphs are not delimited by anything in particular. For what you want to do to work, you would have to be able to split the text on each page into paragraphs. This would work if, for example, every paragraph ended with a new-line tag `\n` and that tag was only used at the end of paragraphs. However, that's not the case here. — DaveArmstrong, Sep 16 '20 at 10:31
Yes, each sentence will end with a new line tag `\n`, but I don't know how to get the entire paragraph. Would you know how @DaveArmstrong? — David Perea, Sep 16 '20 at 11:19
I think that’s the problem. There is nothing consistent that separates one paragraph from another, so without some manual intervention I don’t think it is possible. Perhaps someone else will have a suggestion. — DaveArmstrong, Sep 16 '20 at 12:02

score 0 · Answer 1 · answered Sep 16 '20 at 13:53

The text lines in paragraphs will all be quite long unless it is the final line of the paragraph. We can count the characters in each line and do a histogram to show this:

library(textreadr)

doc <- read_pdf('https://www.in.kpmg.com/pdf/Indian%20Pharma%20Outlook.pdf')

hist(nchar(doc$text), 20)

So anything less than about 75 characters is either not in a paragraph or at the end of a paragraph. We can therefore stick a line break on the short ones, paste all the lines together, then split on linebreaks:


doc$text[nchar(doc$text) < 75] <- paste0(doc$text[nchar(doc$text) < 75], "\n")
txt <- paste(doc$text, collapse = " ")
txt <- strsplit(txt, "\n")[[1]]

So now we can just do our regex and find the paragraphs with the key word:

grep("cancer", txt, value = TRUE)
#> [1] " Ranjit Shahani applauds the National Pharmaceuticals Policy's proposal of public/private partnerships (PPPs) to tackle life-threatening diseases such as cancer and HIV/AIDS, but stresses that, in order for them to work, they should be voluntary, and the government should exempt all life-saving drugs from import duties and other taxes such as excise duty and VAT. He is, however, critical about a proposal for mandatory price negotiation of newly patented drugs. He feels this will erode India's credibility in implementing the Patent Act in © 2006 KPMG International. KPMG International is a Swiss cooperative that serves as a coordinating entity for a network of independent firms operating under the KPMG name. KPMG International provides no services to clients. Each member firm of KPMG International is a legally distinct and separate entity and each describes itself as such. All rights reserved. Collaboration for Growth 24"                                                                                                   
#> [2] " a fair and transparent manner. To deal with diabetes, medicines are not the only answer; awareness about the need for lifestyle changes needs to be increased, he adds. While industry leaders have long called for the development of PPPs for the provision of health care in India, particularly in rural areas, such initiatives are currently totally unexplored. However, the government's 2006 draft National Pharmaceuticals Policy proposes the introduction of PPPs with drug manufacturers and hospitals as a way of vastly increasing the availability of medicines to treat life-threatening diseases. It notes, for example, that while an average estimate of the value of drugs to treat the country's cancer patients is $1.11 billion, the market is in fact worth only $33.5 million. “The big gap indicates the near non-accessibility of the medicines to a vast majority of the affected population, mainly because of the high cost of these medicines,” says the Policy, which also calls for tax and excise exemptions for anti-cancer drugs."
#> [3] " 50.1 percent of Aventis Pharma is held by European drug major Sanofi-Aventis and, in early April 2006, it was reported that UB Holdings had sold its 10 percent holding in the firm to Variegate Trading, a UB subsidiary. The firm's major products are in the anti-infective, anti-inflammatory, cancer, diabetes and allergy market segments and, for the year ended December 31, 2005, it reported net sales (excluding excise duty) up 9.9 percent to $181.1 million, with domestic sales up 9.1 percent at $129.8 million and exports increasing 12 percent to $51.2 million. Sales were led by 83 percent annual growth for the diabetes treatment Lantus (insulin glargine), followed by the rabies vaccine Rabipur (+22 percent), the diabetes drug Amaryl (glimepiride) and epilepsy treatment Frisium (clobazam), both up 18 percent, the angiotensin-coverting enzyme inhibitor Cardace (ramipril +15 percent), Clexane (enoxaparin), an anticoagulant, growing 14 percent and Targocid (teicoplanin), an antibiotic, whose sales advanced 8 percent."

^{Created on 2020-09-16 by the reprex package (v0.3.0)}

Extract the paragraphs from a PDF that contain a keyword using R

1 Answers1