the process of getting data out of a PDF, this involves opening, reading and parsing the contents of the PDF to extract text, images, metadata or attachments
Questions tagged [pdf-scraping]
144 questions
416
votes
13 answers
Python module for converting PDF to text
Is there any python module to convert PDF files into text? I tried one piece of code found in Activestate which uses pypdf but the text generated had no space between and was of no use.

cnu
- 36,135
- 23
- 65
- 63
51
votes
5 answers
Reading data from PDF files into R
Is that even possible!?!
I have a bunch of legacy reports that I need to import into a database. However, they're all in pdf format. Are there any R packages that can read pdf? Or should I leave that to a command line tool?
The reports were made…

Justin
- 42,475
- 9
- 93
- 111
51
votes
3 answers
Extract / Identify Tables from PDF python
Are there any open source libraries that support table identification & extraction?
By this I mean:
Identify a table structure exists
Classify the table from its contents
Extract data from the table in a useful output format e.g. JSON / CSV…

Alexander McFarlane
- 10,643
- 9
- 59
- 100
31
votes
10 answers
How to unlock a "secured" (read-protected) PDF in Python?
In Python I'm using pdfminer to read the text from a pdf with the code below this message. I now get an error message saying:
File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfpage.py", line 124, in get_pages
raise…

kramer65
- 50,427
- 120
- 308
- 488
24
votes
4 answers
Recognize PDF table using R
I'm trying to extract data from tables inside some pdf reports.
I've seen some examples using either pdftools and similar packages I was successful in getting the text, however, I just want to extract the tables.
Is there a way to use R to recognize…

RCS
- 263
- 1
- 2
- 9
15
votes
7 answers
Scraping large pdf tables which span across multiple pages
I am trying to scrape PDF tables which span across multiple pages. I tried many things but the best seems to be pdftotext -layout as advised here. The problem is that the resultant text file is not easy to work with, as the table layout differs…

Tomas
- 57,621
- 49
- 238
- 373
12
votes
2 answers
How to read pdf file using pdfminer3k?
I am using python 3.5 and I want to read the text, line by line from pdf files. Was trying to use pdfminer3k but not getting proper syntax anywhere.
How to use it correctly?

poshita singh
- 131
- 1
- 1
- 9
10
votes
3 answers
Parsing pdf files
I have a requirement to split a large pdf document into smaller files based on the content of the file. We use BCL easyPDF to manipulate pdf files. easyPDF can split pdf documents based on a page number, but it cannot split the document based on the…

desi
- 793
- 2
- 7
- 8
9
votes
1 answer
Is there a Google Image Search API?
I'm searching for an API or a program (preferably Python and open-source) which lets me download the first n pictures of a Google Image Search for let's say bicycles. It would also be helpful if it could download the first n .pdf files from a normal…

technical_difficulty
- 427
- 2
- 5
- 20
6
votes
6 answers
what is the best way to extract data from pdf
I have thousands of pdf file that I need to extract data from.This is an example pdf. I want to extract this information from the example pdf.
I am open to nodejs, python or any other effective method. I have little knowledge in python and nodejs.…

e.iluf
- 1,389
- 5
- 27
- 69
6
votes
2 answers
Tabulizer package in R: how to scrape tables after specific Title
How to scrape tables preceded with some title text from PDF?
I am experimenting with tabulizer package. Here an example of getting a table from a specific page (Polish "Map of Public Health…

Jacek Kotowski
- 620
- 16
- 49
6
votes
1 answer
I want to scrape a Hindi(Indian Langage) pdf file with python
I have written python code that scrapes all the data from the PDF file. The problem here is that once it is scraped,the words lose their grammer. How to fix these problem?
I am attaching the code.
from pdfminer.pdfinterp import PDFResourceManager,…

Abhinav Mishra
- 195
- 13
5
votes
0 answers
pdftotext get font information (font-family, style, size)
I'm using "pdftotext -bbox file.pdf" to convert a pdf file into HTML.
Here's a sample line from the output:
foo
Is there a way to get font information for every word…

James Kroning
- 61
- 5
4
votes
4 answers
Working on tables in pdf using python
I am working on a pdf file. There is number of tables in that pdf.
According to the table names given in the pdf, I wanted to fetch the data from that table using python.
I have worked on html, xlm parsing but never with pdf.
Can anyone tell me how…

sam
- 18,509
- 24
- 83
- 116
4
votes
1 answer
Programmatically replace text in PDF
I have PDF files with text that should be replaced. More specificly, the text should be translated and replaced with the translated version.
It's important that the rest of the PDF structure stays intact. Note that the text is available in the PDFs…

BramD
- 49
- 1
- 2