Extract Subpart of pdf text in r

Question

I have a list of .pdf files in a folder for which I want to first access the first two paragraphs of text then store them in .csv file, I'm able to convert the pdf text but not able to extract first two paragraphs.

This is what I have tried

setwd("D/All_PDF_Files")
install.packages("pdftools")
install.packages("qdapRegex")
library(pdftools)
library(qdapRegex)
All_files=Sys.glob("*.pdf")
txt <- pdf_text("first.pdf")
cat(txt[1])
rm_between(txt, 'This ', '1. ', extract=TRUE)[[1]]

But this gives me "NA"

The output of cat(txt[1]) is:

"Maharashtra Real Estate Regulatory Authority
                                         REGISTRATION CERTIFICATE OF PROJECT
                                                             FORM 'C'
                                                           [See rule 6(a)]
This registration is granted under section 5 of the Act to the following project under project registration number :
P52100000255
Project: Ganga Legend A3 And B3.., Plot Bearing / CTS / Survey / Final Plot No.: Sr No 305 P , 306 P and 339 P ,
Village Bavdhan Budruk, Taluka Mulashi,District Pune at Pune (M Corp.), Pune City, Pune, 411001;
   1. Goel Ganga Developers (I) Pvt Ltd having its registered office / principal place of business at Tehsil: Pune City,
      District: Pune, Pin: 411001.
   2. This registration is granted subject to the following conditions, namely:"

What I want to extract is the text

This registration is granted under section 5 of the Act to the following project under project registration number :
P52100000255
Project: Ganga Legend A3 And B3.., Plot Bearing / CTS / Survey / Final Plot No.: Sr No 305 P , 306 P and 339 P ,
Village Bavdhan Budruk, Taluka Mulashi,District Pune at Pune (M Corp.), Pune City, Pune, 411001;

Is there a better approach to go with?

look into `read_pdf` function from [textreadr](https://cran.r-project.org/web/packages/textreadr/textreadr.pdf) package where _number of lines can be skipped before beginning to read data_ to suit your purpose here — parth, Sep 14 '17 at 07:19
using read_pdf::: s=read_pdf("D:/All_PDF_Files/first.pdf", skip = 4, remove.empty = TRUE, trim = TRUE) s$text[1:4], gives all rows in different lines and not on single one line — Andre_k, Sep 14 '17 at 07:38
after step above, won't just removing row containing "Maharastra......" solve the problem ? — parth, Sep 14 '17 at 08:40

score 2 · Accepted Answer · edited Nov 17 '21 at 04:37

2

library(pdftools)

setwd("D/All_PDF_Files")
All_files=Sys.glob("*.pdf")

df <- data.frame()
for (i in 1:length(All_files))
{
  txt <- pdf_text(All_files[i])
  
  file_name <- All_files[i]
  #skip first 4 header rows (you may need to adjust this count according to your use case)
  FirstPara <- unlist(strsplit(txt[1], split=c("\r\n", "\r", "\n")))[1+4]
  SecondPara <- unlist(strsplit(txt[1], split=c("\r\n", "\r", "\n")))[2+4]
  
  df <- rbind(df, cbind(file_name, FirstPara, SecondPara))
}
df

edited Nov 17 '21 at 04:37

Nimantha

6,405
6
28
69

answered Sep 14 '17 at 07:47

Prem

11,775
1
19
33

This solution gives 20% of solution .... I think you should check the expected text I want to extract.....I want to eliminate first 4 header rows and extract just this text "This registration is granted under section 5 of the Act to the following project under project registration number : P52100000255 Project: Ganga Legend A3 And B3.., Plot Bearing / CTS / Survey / Final Plot No.: Sr No 305 P , 306 P and 339 P , Village Bavdhan Budruk, Taluka Mulashi,District Pune at Pune (M Corp.), Pune City, Pune, 411001; " – Andre_k Sep 14 '17 at 08:01
Your code produces output in three columns instead I expect only two column "filename" "text" and text column should contain the expected text. – Andre_k Sep 14 '17 at 08:03
I believe if you print `strsplit(txt[1], split=c("\r\n", "\r", "\n"))` and see how many list items these 4 header rows constitute then you can simply replace `4` (in the above code) with that number and it should give you the desired result. – Prem Sep 14 '17 at 08:47
In order to get only two columns in `df` you can replace code after `file_name <-` with `text <- paste(unlist(strsplit(txt[1], split=c("\r\n", "\r", "\n")))[(1:2)+4], collapse = ";"); df <- rbind(df, cbind(file_name, text))`. – Prem Sep 14 '17 at 08:51
Your answer just extracts "This registration is granted under section 5 of the Act to the following project under project registration number : P52100000255" and not the othertext as mention in the expected output. – Andre_k Sep 14 '17 at 09:21
Please provide the o/p of `unlist(strsplit(txt[1], split=c("\r\n", "\r", "\n")))[(1:20)]` – Prem Sep 14 '17 at 09:46
[1] "Maharashtra Real Estate Regulatory Authority"[2]"REGISTRATION CERTIFICATE OF PROJECT" [3] " FORM 'C'" [4] "[See rule 6(a)]"[5]"This registration is granted under section 5 of the Act to the following roject under project registration number :" [6] "P50500003509" [7] "Project: Vista 3b, Plot Bearing / CTS / Survey / Final Plot No.: Kh No.97/1 Pt 97/2 Pt 97/3 pt 97/4 pt at Pipla, Nagpur" [8] "(Rural), Nagpur, 440034;" [9] " 1. Luxora Infrastructure Private Limited having its registered office / principal place of business at Tehsil: Kurla,[10] " District: Mumbai Suburban, – Andre_k Sep 14 '17 at 10:02
Above comment is the output of "unlist(strsplit(txt[1], split=c("\r\n", "\r", "\n")))[(1:20)]" – Andre_k Sep 14 '17 at 10:03
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/154428/discussion-between-prem-and-deepesh). – Prem Sep 14 '17 at 10:06

Andre_k · Answer 2 · 2017-09-15T07:32:51.533

Posting the answer using @Prem's code, if anyone could need that.

All_files=Sys.glob("*.pdf")

df <- data.frame()
for (i in 1:length(All_files))
{
  txt <- pdf_text(All_files[i])

  file_name <- All_files[i]

  FirstPara <- unlist(strsplit(txt[1], split=c("\r\n", "\r", "\n")))[1+4]
  SecondPara <- unlist(strsplit(txt[1], split=c("\r\n", "\r", "\n")))[2+4]
  ThirdPara <- unlist(strsplit(txt[1], split=c("\r\n", "\r", "\n")))[3+4]
  ThirdPara_new <- sub("[^:]+:\\s*([^,]+),.*", "\\1",ThirdPara)
  t1=unlist(strsplit(txt[1], split=c("\r\n", "\r", "\n")))[4+4]
  t2=unlist(strsplit(txt[1], split=c("\r\n", "\r", "\n")))[5+4]
  conct=paste(t1,t2)
  FourthPara=gsub(".*1. \\s*|having.*|son.*", "", conct)

  df <- rbind(df, cbind(file_name, SecondPara, ThirdPara_new, FourthPara))

}

Extract Subpart of pdf text in r

2 Answers2

Linked