Questions tagged [pdf-scraping]

the process of getting data out of a PDF, this involves opening, reading and parsing the contents of the PDF to extract text, images, metadata or attachments

144 questions
416
votes
13 answers

Python module for converting PDF to text

Is there any python module to convert PDF files into text? I tried one piece of code found in Activestate which uses pypdf but the text generated had no space between and was of no use.
cnu
  • 36,135
  • 23
  • 65
  • 63
51
votes
5 answers

Reading data from PDF files into R

Is that even possible!?! I have a bunch of legacy reports that I need to import into a database. However, they're all in pdf format. Are there any R packages that can read pdf? Or should I leave that to a command line tool? The reports were made…
Justin
  • 42,475
  • 9
  • 93
  • 111
51
votes
3 answers

Extract / Identify Tables from PDF python

Are there any open source libraries that support table identification & extraction? By this I mean: Identify a table structure exists Classify the table from its contents Extract data from the table in a useful output format e.g. JSON / CSV…
Alexander McFarlane
  • 10,643
  • 9
  • 59
  • 100
31
votes
10 answers

How to unlock a "secured" (read-protected) PDF in Python?

In Python I'm using pdfminer to read the text from a pdf with the code below this message. I now get an error message saying: File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfpage.py", line 124, in get_pages raise…
kramer65
  • 50,427
  • 120
  • 308
  • 488
24
votes
4 answers

Recognize PDF table using R

I'm trying to extract data from tables inside some pdf reports. I've seen some examples using either pdftools and similar packages I was successful in getting the text, however, I just want to extract the tables. Is there a way to use R to recognize…
RCS
  • 263
  • 1
  • 2
  • 9
15
votes
7 answers

Scraping large pdf tables which span across multiple pages

I am trying to scrape PDF tables which span across multiple pages. I tried many things but the best seems to be pdftotext -layout as advised here. The problem is that the resultant text file is not easy to work with, as the table layout differs…
Tomas
  • 57,621
  • 49
  • 238
  • 373
12
votes
2 answers

How to read pdf file using pdfminer3k?

I am using python 3.5 and I want to read the text, line by line from pdf files. Was trying to use pdfminer3k but not getting proper syntax anywhere. How to use it correctly?
poshita singh
  • 131
  • 1
  • 1
  • 9
10
votes
3 answers

Parsing pdf files

I have a requirement to split a large pdf document into smaller files based on the content of the file. We use BCL easyPDF to manipulate pdf files. easyPDF can split pdf documents based on a page number, but it cannot split the document based on the…
desi
  • 793
  • 2
  • 7
  • 8
9
votes
1 answer

Is there a Google Image Search API?

I'm searching for an API or a program (preferably Python and open-source) which lets me download the first n pictures of a Google Image Search for let's say bicycles. It would also be helpful if it could download the first n .pdf files from a normal…
6
votes
6 answers

what is the best way to extract data from pdf

I have thousands of pdf file that I need to extract data from.This is an example pdf. I want to extract this information from the example pdf. I am open to nodejs, python or any other effective method. I have little knowledge in python and nodejs.…
e.iluf
  • 1,389
  • 5
  • 27
  • 69
6
votes
2 answers

Tabulizer package in R: how to scrape tables after specific Title

How to scrape tables preceded with some title text from PDF? I am experimenting with tabulizer package. Here an example of getting a table from a specific page (Polish "Map of Public Health…
Jacek Kotowski
  • 620
  • 16
  • 49
6
votes
1 answer

I want to scrape a Hindi(Indian Langage) pdf file with python

I have written python code that scrapes all the data from the PDF file. The problem here is that once it is scraped,the words lose their grammer. How to fix these problem? I am attaching the code. from pdfminer.pdfinterp import PDFResourceManager,…
5
votes
0 answers

pdftotext get font information (font-family, style, size)

I'm using "pdftotext -bbox file.pdf" to convert a pdf file into HTML. Here's a sample line from the output: foo Is there a way to get font information for every word…
4
votes
4 answers

Working on tables in pdf using python

I am working on a pdf file. There is number of tables in that pdf. According to the table names given in the pdf, I wanted to fetch the data from that table using python. I have worked on html, xlm parsing but never with pdf. Can anyone tell me how…
sam
  • 18,509
  • 24
  • 83
  • 116
4
votes
1 answer

Programmatically replace text in PDF

I have PDF files with text that should be replaced. More specificly, the text should be translated and replaced with the translated version. It's important that the rest of the PDF structure stays intact. Note that the text is available in the PDFs…
BramD
  • 49
  • 1
  • 2
1
2 3
9 10