Highest Voted 'pdf-scraping' Questions

416

votes

13 answers

Python module for converting PDF to text

Is there any python module to convert PDF files into text? I tried one piece of code found in Activestate which uses pypdf but the text generated had no space between and was of no use.

asked Aug 25 '08 at 04:44

cnu

36,135
23
65
63

51

votes

5 answers

Reading data from PDF files into R

Is that even possible!?! I have a bunch of legacy reports that I need to import into a database. However, they're all in pdf format. Are there any R packages that can read pdf? Or should I leave that to a command line tool? The reports were made…

linux r pdf scrape pdf-scraping

asked Feb 07 '12 at 23:46

Justin

42,475
9
93
111

51

votes

3 answers

Extract / Identify Tables from PDF python

Are there any open source libraries that support table identification & extraction? By this I mean: Identify a table structure exists Classify the table from its contents Extract data from the table in a useful output format e.g. JSON / CSV…

python pdf scrape pdf-parsing pdf-scraping

asked Feb 16 '15 at 00:04

Alexander McFarlane

10,643
9
59
100

31

votes

10 answers

How to unlock a "secured" (read-protected) PDF in Python?

In Python I'm using pdfminer to read the text from a pdf with the code below this message. I now get an error message saying: File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfpage.py", line 124, in get_pages raise…

python pdf pdfminer pdf-scraping

asked Jan 28 '15 at 13:02

kramer65

50,427
120
308
488

24

votes

4 answers

Recognize PDF table using R

I'm trying to extract data from tables inside some pdf reports. I've seen some examples using either pdftools and similar packages I was successful in getting the text, however, I just want to extract the tables. Is there a way to use R to recognize…

r text-mining pdf-scraping

asked May 23 '17 at 17:15

RCS

263
1
2
9

15

votes

7 answers

Scraping large pdf tables which span across multiple pages

I am trying to scrape PDF tables which span across multiple pages. I tried many things but the best seems to be pdftotext -layout as advised here. The problem is that the resultant text file is not easy to work with, as the table layout differs…

r perl ms-access pdf-scraping

asked Aug 06 '13 at 10:58

Tomas

57,621
49
238
373

12

votes

2 answers

How to read pdf file using pdfminer3k?

I am using python 3.5 and I want to read the text, line by line from pdf files. Was trying to use pdfminer3k but not getting proper syntax anywhere. How to use it correctly?

python-3.x python-3.5 pdf-scraping

asked May 17 '17 at 12:20

poshita singh

131
1
1
9

10

votes

3 answers

Parsing pdf files

I have a requirement to split a large pdf document into smaller files based on the content of the file. We use BCL easyPDF to manipulate pdf files. easyPDF can split pdf documents based on a page number, but it cannot split the document based on the…

c# parsing pdf pdf-scraping

asked May 03 '12 at 18:19

desi

793
2
7
8

9

votes

1 answer

Is there a Google Image Search API?

I'm searching for an API or a program (preferably Python and open-source) which lets me download the first n pictures of a Google Image Search for let's say bicycles. It would also be helpful if it could download the first n .pdf files from a normal…

python web-scraping google-image-search pdf-scraping

asked Apr 07 '16 at 12:03

technical_difficulty

427
2
5
20

6

votes

6 answers

what is the best way to extract data from pdf

I have thousands of pdf file that I need to extract data from.This is an example pdf. I want to extract this information from the example pdf. I am open to nodejs, python or any other effective method. I have little knowledge in python and nodejs.…

python node.js pdf pdf-scraping

asked Sep 14 '19 at 21:42

e.iluf

1,389
5
27
69

6

votes

2 answers

Tabulizer package in R: how to scrape tables after specific Title

How to scrape tables preceded with some title text from PDF? I am experimenting with tabulizer package. Here an example of getting a table from a specific page (Polish "Map of Public Health…

r web-scraping tidyverse pdf-scraping tabulizer

asked Jan 28 '19 at 14:08

Jacek Kotowski

620
16
49

6

votes

1 answer

I want to scrape a Hindi(Indian Langage) pdf file with python

I have written python code that scrapes all the data from the PDF file. The problem here is that once it is scraped,the words lose their grammer. How to fix these problem? I am attaching the code. from pdfminer.pdfinterp import PDFResourceManager,…

python pdf ocr pdfminer pdf-scraping

asked Mar 14 '16 at 18:50

Abhinav Mishra

195
13

5

votes

0 answers

pdftotext get font information (font-family, style, size)

I'm using "pdftotext -bbox file.pdf" to convert a pdf file into HTML. Here's a sample line from the output: foo Is there a way to get font information for every word…

text-extraction pdftotext poppler pdf-scraping xpdf

asked May 06 '18 at 11:23

James Kroning

61
5

4

votes

4 answers

Working on tables in pdf using python

I am working on a pdf file. There is number of tables in that pdf. According to the table names given in the pdf, I wanted to fetch the data from that table using python. I have worked on html, xlm parsing but never with pdf. Can anyone tell me how…

python pdf pdf-scraping

asked Mar 20 '12 at 07:42

sam

18,509
24
83
116

4

votes

1 answer

Programmatically replace text in PDF

I have PDF files with text that should be replaced. More specificly, the text should be translated and replaced with the translated version. It's important that the rest of the PDF structure stays intact. Note that the text is available in the PDFs…

pdf pdf-scraping

asked Jul 05 '11 at 23:50

BramD

49
1
2

Questions tagged [pdf-scraping]