0

I want to extract information related to keyword "cancer" from list of pdf using R.

i want to extract before and after lines or paragraph containing word cancer in text file.

abstracts <- lapply(mytxtfiles, function(i) {
j <- paste0(scan(i, what = character()), collapse = " ")
regmatches(j, gregexpr("(?m)(^[^\\r\\n]*\\R+){4}[cancer][^\\r\\n]*\\R+(^[^\\r\\n]*\\R+){4}", j, perl=TRUE))})

above regex is not working

  • 1
    `[cancer]` != `cancer` ! The first is a character class, the latter a literal. – Jan Apr 14 '17 at 15:27
  • If you use `\R`, you must use `perl=TRUE`. – Wiktor Stribiżew Apr 14 '17 at 15:29
  • Replace all `[^\r\n]*` with `.*` and `[cancer][^\\r\\n]*` with `.*cancer.*`. See [`(?m)(^.*\R+){4}.*cancer.*(\R+.*){4}`](https://regex101.com/r/Hbr9ep/1). If there not enough lines, replace `{4}` with `{0,4}`. – Wiktor Stribiżew Apr 14 '17 at 15:36
  • Thanks for the suggestion. But I am using this regex in R programming. I am using the below code as suggested abstracts <- lapply(mytxtfiles, function(i) { j <- paste0(scan(i, what = character()), collapse = " ") regmatches(j, gregexpr("(?m)(^.*\\R+){2}.*cancer.*(\\R+.*){2}", j, perl=TRUE)) }) But its not giving me desired output – Santosh Kadge Apr 14 '17 at 16:45
  • Can you please show us some text of the pdf as input, It is very difficult to see with what text you are trying to match. Make something which is reproducible. http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example. Thanks – PKumar Apr 14 '17 at 17:38
  • here is the link of pdf i am trying to text mine->https://www.in.kpmg.com/pdf/Indian%20Pharma%20Outlook.pdf – Santosh Kadge Apr 14 '17 at 19:23

1 Answers1

0

Here's one approach:

library(textreadr)
library(tidyverse)

loc <- function(var, regex, n = 1, ignore.case = TRUE){
    locs <- grep(regex, var, ignore.case = ignore.case)
    out <- sort(unique(c(locs - 1, locs, locs + 1)))
    out <- out[out > 0]
    out[out <= length(var)]
}

doc <- 'https://www.in.kpmg.com/pdf/Indian%20Pharma%20Outlook.pdf' %>%
    read_pdf() %>%
    slice(loc(text, 'cancer'))

doc

##    page_id element_id                                                                                                                  text
## 1       24         28                              Ranjit Shahani applauds the National Pharmaceuticals Policy's proposal of public/private
## 2       24         29                              partnerships (PPPs) to tackle life-threatening diseases such as cancer and HIV/AIDS, but
## 3       24         30                                stresses that, in order for them to work, they should be voluntary, and the government
## 4       25          8                         the availability of medicines to treat life-threatening diseases. It notes, for example, that
## 5       25          9                             while an average estimate of the value of drugs to treat the country's cancer patients is
## 6       25         10                             $1.11 billion, the market is in fact worth only $33.5 million. “The big gap indicates the
## 7       25         12                           because of the high cost of these medicines,” says the Policy, which also calls for tax and
## 8       25         13                                                                              excise exemptions for anti-cancer drugs.
## 9       25         14                       Another area for which PPPs are proposed is for drugs to treat HIV/AIDS, India's biggest health
## 10      32         19                              Variegate Trading, a UB subsidiary. The firm's major products are in the anti-infective,
## 11      32         20                               anti-inflammatory, cancer, diabetes and allergy market segments and, for the year ended
## 12      32         21                             December 31, 2005, it reported net sales (excluding excise duty) up 9.9 percent to $181.1
Tyler Rinker
  • 108,132
  • 65
  • 322
  • 519
  • Thanks for giving me different approach. Can we do this for multiple pdf stored in perticular location. Also using this I am able to extract the lines containing word cancer and not the lines before and after. How can I extract before and after lines with line containing the word 'cancer – Santosh Kadge Apr 27 '17 at 17:59
  • Yes you can do for multiple dfs. See `read_dir` function. I have shown above the lines before and after so I don't know what you mean by lines before and after. For example line 29 has the word cancer. I include line 28 and 30 as well. – Tyler Rinker Apr 28 '17 at 18:39
  • can we separate the lines by full stop.I am considering one line as a one complete sentence with full stop. – Santosh Kadge Apr 28 '17 at 19:37
  • Yes but it requires collapsing the entire document and splitting apart by sentence. This means headers and such will be treated as part of the next sentence. – Tyler Rinker Apr 29 '17 at 00:32