0

I have multiple pdf files saved in a folder. I need to extract the first date of the format like "November 19 2020" from each file in a data frame.

Here is the code I am using:

myextr2 <- function(pdffile) {
  text_data <- pdf_text(pdffile)
  text_collapsed_data <- paste0(text_data, collapse = '\n')
  g=stringi::stri_extract( text_collapsed_data, regex = ("(Jan(uary)?|Feb(ruary)?|Mar(ch)?|Apr(il)?|May|Jun(e)?|Jul(y)?|Aug(ust)?|Sep(tember)?|Oct(ober)?|Nov(ember)?|Dec(ember)?)/s+/d{1,2}/s+/d{4}")
  g[1]
}
files <- list.files(pattern = "pdf$")
pricing = sapply(files, myextr2)
pricing

I am getting the following error:

Error: unexpected '}' in "}"

Need help on this.

colourCoder
  • 1,394
  • 2
  • 11
  • 18
  • Hi i was missing a close parenthesis in the end of the regex but unable to pickup any date – Manish Mukherjee Sep 06 '20 at 03:52
  • It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. Maybe see also this: https://stackoverflow.com/questions/51035220/why-is-it-so-hard-to-convert-pdf-to-plain-text or https://filingdb.com/b/pdf-text-extraction – MrFlick Sep 06 '20 at 03:55
  • It's `\s` and `\d`, not `/s` and `/d`. (I don't know R though, so the backslash might need to be escaped). Also, "Novemeber" is misspelled in the example (outside the code). Other than that, the pattern itself [works fine](https://regex101.com/r/Wtpdyg/1). It just has too many unnecessary capturing groups. – 41686d6564 stands w. Palestine Sep 06 '20 at 03:57

2 Answers2

0

You can use lubridate package.

library(lubridate)

d = "November 19 2020"
mdy(d)
# [1] "2020-11-19"
library(stringr)

str_extract(d, "(Jan(uary)?|Feb(ruary)?|Mar(ch)?|Apr(il)?|May|Jun(e)?|Jul(y)?|Aug(ust)?|Sep(tember)?|Oct(ober)?|Nov(ember)?|Dec(ember)?)\\s+\\d{1,2}\\s+\\d{4}")
# [1] "2020-11-19"
Jingxin Zhang
  • 234
  • 2
  • 3
0

Here the Regex what i have tried and is working form me just a small correction it is the second date value instaed of first as i posted earlier ''' str_extract_all(text_collapsed_data , "(\b(Jan(uary)?|Feb(ruary)?|Mar(ch)?|Apr(il)?|May|Jun(e)?|Jul(y)?|Aug(ust)?|Sep(tember)?|Oct(ober)?|Nov(ember)?|Dec(ember)?)\s+\d{1,2},\s+\d{4})|(\b(JAN(UARY)?|FEB(RUARY)?|MAR(CH)?|APR(IL)?|MAY|JUN(E)?|JUL(Y)?|AUG(UST)?|SEP(TEMBER)?|OCT(OBER)?|NOV(EMBER)?|DEC(EMBER)?)\s+\d{1,2},\s+\d{4})" , simplify = TRUE)[,2] '''