Questions tagged [pdftools]

An R package for Text Extraction, Rendering and Converting of PDF Documents

Utilities based on 'libpoppler' for extracting text, fonts, attachments and metadata from a PDF file. Also supports high quality rendering of PDF documents into PNG, JPEG, TIFF format, or into raw bitmap vectors for further processing in R.

97 questions
15
votes
2 answers

Extract Text from Two-Column PDF with R

I have a lot of PDFs which are in two-column format. I am using the pdftools package in R. Is there a way to read each PDF according to the two-column format without cropping each PDF individually? Each PDF consists of selectable text, and the…
tsouchlarakis
  • 1,499
  • 3
  • 23
  • 44
5
votes
1 answer

Split string to columns based on paragraph ending from ocr'd image

I'm working on a project to convert type-writer written War Diary notes into text, from PDF scans. I can successfully (maybe 90% with original non-re-sized file) extract the main text, which I crop first. Reprex data: You could try this from the…
Corey Pembleton
  • 717
  • 9
  • 23
4
votes
2 answers

Using the pdf_data function from the pdftools package efficiently

The end goal is to use the pdftools package to efficiently move through a thousand pages of pdf documents to consistently, and safely, produce a useable dataframe/tibble. I have attempted to use the tabulizer package, and pdf_text functions, but…
James Crumpler
  • 192
  • 1
  • 8
3
votes
1 answer

r pdftools: Combine multiple pages into a single page

The pdf_combine function from pdftool r package can be used to combine different pdf documents. pdftools::pdf_combine( input = list( "Page1.pdf" , "Page2.pdf" …
MYaseen208
  • 22,666
  • 37
  • 165
  • 309
3
votes
0 answers

Unable to load R Package: 'pdftools'

First time posting here and new to R. I am having trouble with loading pdftools into R studio for text mining. #1 - I am able to install the package successfully #2 - Once I attempt to loadlibrary(pdftools) I receive the following output. Error:…
movies
  • 31
  • 1
3
votes
3 answers

Convert scanned PDF to searcheable PDF (in R)

I'm trying to convert a series of scanned PDF into searchable PDF using the tesseract and pdftools packages. I've accomplished two steps. Now I need to write back to a searchable pdf. Read scanned PDF Run OCR Write back to a searcheable PDF eg <-…
Thomas Speidel
  • 1,369
  • 1
  • 14
  • 26
3
votes
2 answers

Scraping PDF tables with empty Cells

I'm using R to pull data from PDFs and so far it has been going well. I just opened up a new batch of PDFs and saw that I have to figure out how to account for empty cells. I haven't found a way to do this, and I have hundreds of pages that I need…
pkpto39
  • 545
  • 4
  • 11
3
votes
1 answer

Scraping a Table from a PDF File

I am trying to scrape the first table of multiple PDF's that look quite similar. So far I have isolated the page of the table, converted the table to a string and loaded it into R. Additionally, I also managed to remove the parts of the table I am…
fabla
  • 1,806
  • 1
  • 8
  • 20
3
votes
0 answers

Reading tables from PDF in R

I have a PDF with many tables in it, and I'm trying to parse them into a more readable format using R. So far, I've tried two methods: using pdftools::pdftext() to get the text, then basically using regexes to manually read in the tables (honestly…
AWhite
  • 75
  • 7
2
votes
0 answers

R session aborted due to fatal error when running pdftools

Running this code I received the "r session aborted: R encountered a fatal error" I tried uninstalling and reinstalling R 4.3.0 and most recent version of RStudio. This is the code I…
2
votes
0 answers

R Shiny how to show status messages from the console (pdf_ocr_text)

When I use pdf_ocr_text from pdftools for example:text1 <- pdf_ocr_text("0.pdf", dpi = 300), it will show the status in the R console like below. Converting page 1 to 0_1.png... done! Converting page 2 to 0_2.png... done! Converting page 3 to…
Subaru Spirit
  • 394
  • 3
  • 19
2
votes
0 answers

How to extract title of each page from the PDF using Python

I want to extract the title of each page of PDF, but my pdfs does not have similar or predefine size of title (title size is varying in every page), I tried following code, but its not giving me the expected output, instead its extracting whole text…
2
votes
2 answers

Split PDF files in multiples files every 2 pages in R

I have a PDF document with 300 pages. I need to split this file in 150 files containing each one 2 pages. For example, the 1st document would contain pages 1 & 2 of the original file, the 2nd document, the pages 3 & 4 and so on. Maybe I can use the…
Dani
  • 153
  • 8
2
votes
4 answers

Create table from wrapped text in R

Edited: From text based on variable named a I would like to obtain a table in which description cell will be unwrapped. a <- " category variable description value A A This is variable named as…
mario19088
  • 101
  • 6
2
votes
2 answers

R Find element of the list to extract table from pdf

I'm trying to use pdftools package to extract data table from a pdf. My source file is here: https://hypo.org/app/uploads/sites/2/2021/11/HYPOSTAT-2021_vdef.pdf. Say, I want to extract data from Table 20 on page 170 (Change in Nominal house price) I…
Chris
  • 251
  • 1
  • 7
1
2 3 4 5 6 7