Questions tagged [pdf-parsing]

Deals with extracting useful information from PDF content (for example, text or images)

PDF (Portable Document Format) is a binary format for digital documents. This tag is concerned with parsing these documents, that is to say, extract text, images or other data from them, or convert them to simpler formats (such as plain-text).

Because of the complexity of the PDF format (cf. the specification ISO 32000-1), its parsers are often incomplete (can't extract all information from all documents), and subject to security risks.

For example, pdf-parser is a command-line program that parses and analyses PDF documents. It provides features to extract raw data from PDF documents, like compressed images.

Python Related Options:

You may extract the table directly using camelot PDF Table Extraction for Humans
You may treat the pdf directly using tabula
You may convert the pdf to text using pdftotext, then parse text with python
You may use external tool, to convert your pdf file to excel or csv, then use required python module to open the excel/csv file.
You may also convert pdf to an image file, then use any recent OCR software (which reconstruct table automatically from the picture) to get data
pdf2image with pytesseract and an example.

Related Questions:

177 questions

votes

5 answers

How to extract a table as text from the PDF

I have a PDF which contains Tables, text and some images. I want to extract the table wherever tables are there in the PDF. Right now am doing manually to find the Table from the page. From there I am capturing that page and saving into another…

python pdf pdf-parsing

asked Nov 28 '17 at 14:23

venkat

1,203
3
16
37

votes

3 answers

Extract / Identify Tables from PDF python

Are there any open source libraries that support table identification & extraction? By this I mean: Identify a table structure exists Classify the table from its contents Extract data from the table in a useful output format e.g. JSON / CSV…

python pdf scrape pdf-parsing pdf-scraping

asked Feb 16 '15 at 00:04

Alexander McFarlane

10,643
9
59
100

votes

6 answers

Ruby: Reading PDF files

I'm looking for a fast and reliable way to read/parse large PDF files in Ruby (on Linux and OSX). Until now I've found the rather old and simple PDF-toolkit (a pdftotext-wrapper) and PDF-reader, which was unable to read most of my files. Though the…

ruby-on-rails ruby pdf pdf-parsing

asked Apr 21 '09 at 15:31

Javier

2,491
4
36
57

votes

1 answer

Extracting table contents from a collection of PDF files

I have a stack of PDFs - potentially hundreds or thousands. They are not all formatted the same, but any of them MAY have one or more tables with interesting information that I would like to collect into a separate database. Of course, I know I…

parsing pdf extract pdf-parsing

asked Jun 20 '13 at 15:04

elbillaf

1,952
10
37
73

votes

1 answer

What is this (cid:51) in the output of pdf2txt?

So i'm trying to extract the text from a pdf file, I need its position, width, height, font. I have tried many, but the most useful and complete solution looks to be PDFMiner, and in this case, more exactly pdf2txt.py. I have followed the doc and…

python xml pdf-parsing

asked May 13 '13 at 13:50

Micka

1,648
1
19
34

votes

5 answers

Parsing a PDF with no /Root object using PDFMiner

I'm trying to extract text from a large number of PDFs using PDFMiner python bindings. The module I wrote works for many PDFs, but I get this somewhat cryptic error for a subset of PDFs: ipython stack…

python pypdf pdf-parsing pdf-manipulation

asked Jul 08 '12 at 16:06

Louis Thibault

20,240
25
83
152

votes

1 answer

PDF Data and Table Scraping to Excel

I'm trying to figure out a good way to increase the productivity of my data entry job. What I am looking to do is come up with a way to scrape data from a PDF and input it into Excel. More specifically the data I am working with is from grocery…

excel pdf ocr screen-scraping pdf-parsing

asked Apr 25 '15 at 17:38

Casey Saunders

votes

1 answer

How to scrape tables in thousands of PDF files?

I have about 1'500 PDFs consisting of only 1 page each, and exhibiting the same structure (see http://files.newsnetz.ch/extern/interactive/downloads/BAG_15m_kzh_2012_de.pdf for an example). What I am looking for is a way to iterate over all these…

python node.js parsing web-scraping pdf-parsing

asked Aug 04 '14 at 18:27

grssnbchr

2,877
7
37
71

votes

2 answers

PDF.js not rendering pdf correctly in IE

I am using PDF.js framework to render PDF. I am using base64 data to render PDF. But in IE 11 pdf looking blurry. See below screen from IE 11 See below code : var renderPDF = function(url, canvasContainer,data) { var scale= 0.9; //"zoom"…

javascript canvas pdf.js pdf-parsing pdf-rendering

asked Jul 21 '15 at 15:04

Tushar Ahirrao

12,669
17
64
96

votes

0 answers

Same table is extracted twice from a pdf by Camelot-py

I am trying to extract tables from a multiple page PDF file using camelot-py v0.7.3. So far it has been the best pdf reader tool for me. I just needed to read pdf line by line and detect table manually. I tried many other tools such as tabula,…

python pdf-reader pdf-parsing python-camelot

asked Feb 21 '20 at 18:12

mk09

votes

1 answer

Parse PDF in Node.js

I am using meteor-react for uploading PDF docs to my Node.js backend, where I want to read the uploaded PDF doc, as a json, or whatever. Is it possible? And what library/tool would you recommended for that? Thank you!

node.js pdf-parsing

asked Jan 03 '18 at 08:31

peter

votes

2 answers

Looking for recommendation on how to convert PDF into structured format

I would like to do some analysis on some properties listed in an upcoming auction. Unfortunately, the city running the auction does not publish the information in a structured format but instead provides a 700+ page PDF of the properties going up…

python ruby parsing pdf pdf-parsing

asked Aug 19 '13 at 18:48

doremi

14,921
30
93
148

votes

3 answers

Strange whitespaces when parsing a PDF

I need to parse a PDF document. I already implemented the parser and used the Library iText and till now it worked without any problems. But no I need to parse another document which gets very strange whitespaces in the middle of words. As example I…

java pdf whitespace itext pdf-parsing

asked Aug 10 '12 at 12:36

Prine

12,192
8
40
59

votes

1 answer

Does Commercial use of GhostScript as Saas needs a licence ?

I was working on a project. In which a user can upload PDF and convert it into images and So that i have used GhostScript dll (gsdll32.dll). Now in my application i want to charge from users as monthly subscription so that i can provide them more…

c# pdf open-source ghostscript pdf-parsing

asked Dec 09 '14 at 06:51

objectWithoutClass

1,631
3
14
15

votes

3 answers

How to find Blank Page in pdf file

I can not detect blank page in pdf file. I have searched internet for it but could not find a good solution. Using Itextsharp I tried with page size, Xobjects. But they do not give exact result. I tried if(xobjects==null || textcontent==null…

c# .net pdf itext pdf-parsing

asked Jun 09 '12 at 15:30

Md Kamruzzaman Sarker

2,387
3
22
38

2 3

…

11 12 Next