Questions tagged [pdf-parsing]

Deals with extracting useful information from PDF content (for example, text or images)

PDF (Portable Document Format) is a binary format for digital documents. This tag is concerned with parsing these documents, that is to say, extract text, images or other data from them, or convert them to simpler formats (such as plain-text).

Because of the complexity of the PDF format (cf. the specification ISO 32000-1), its parsers are often incomplete (can't extract all information from all documents), and subject to security risks.

For example, pdf-parser is a command-line program that parses and analyses PDF documents. It provides features to extract raw data from PDF documents, like compressed images.

Python Related Options:

  • You may extract the table directly using camelot PDF Table Extraction for Humans
  • You may treat the pdf directly using tabula
  • You may convert the pdf to text using pdftotext, then parse text with python
  • You may use external tool, to convert your pdf file to excel or csv, then use required python module to open the excel/csv file.
  • You may also convert pdf to an image file, then use any recent OCR software (which reconstruct table automatically from the picture) to get data
  • pdf2image with pytesseract and an example.

Related Questions:

177 questions
53
votes
5 answers

How to extract a table as text from the PDF

I have a PDF which contains Tables, text and some images. I want to extract the table wherever tables are there in the PDF. Right now am doing manually to find the Table from the page. From there I am capturing that page and saving into another…
venkat
  • 1,203
  • 3
  • 16
  • 37
51
votes
3 answers

Extract / Identify Tables from PDF python

Are there any open source libraries that support table identification & extraction? By this I mean: Identify a table structure exists Classify the table from its contents Extract data from the table in a useful output format e.g. JSON / CSV…
Alexander McFarlane
  • 10,643
  • 9
  • 59
  • 100
35
votes
6 answers

Ruby: Reading PDF files

I'm looking for a fast and reliable way to read/parse large PDF files in Ruby (on Linux and OSX). Until now I've found the rather old and simple PDF-toolkit (a pdftotext-wrapper) and PDF-reader, which was unable to read most of my files. Though the…
Javier
  • 2,491
  • 4
  • 36
  • 57
29
votes
1 answer

Extracting table contents from a collection of PDF files

I have a stack of PDFs - potentially hundreds or thousands. They are not all formatted the same, but any of them MAY have one or more tables with interesting information that I would like to collect into a separate database. Of course, I know I…
elbillaf
  • 1,952
  • 10
  • 37
  • 73
15
votes
1 answer

What is this (cid:51) in the output of pdf2txt?

So i'm trying to extract the text from a pdf file, I need its position, width, height, font. I have tried many, but the most useful and complete solution looks to be PDFMiner, and in this case, more exactly pdf2txt.py. I have followed the doc and…
Micka
  • 1,648
  • 1
  • 19
  • 34
14
votes
5 answers

Parsing a PDF with no /Root object using PDFMiner

I'm trying to extract text from a large number of PDFs using PDFMiner python bindings. The module I wrote works for many PDFs, but I get this somewhat cryptic error for a subset of PDFs: ipython stack…
Louis Thibault
  • 20,240
  • 25
  • 83
  • 152
13
votes
1 answer

PDF Data and Table Scraping to Excel

I'm trying to figure out a good way to increase the productivity of my data entry job. What I am looking to do is come up with a way to scrape data from a PDF and input it into Excel. More specifically the data I am working with is from grocery…
Casey Saunders
  • 141
  • 1
  • 1
  • 3
13
votes
1 answer

How to scrape tables in thousands of PDF files?

I have about 1'500 PDFs consisting of only 1 page each, and exhibiting the same structure (see http://files.newsnetz.ch/extern/interactive/downloads/BAG_15m_kzh_2012_de.pdf for an example). What I am looking for is a way to iterate over all these…
grssnbchr
  • 2,877
  • 7
  • 37
  • 71
11
votes
2 answers

PDF.js not rendering pdf correctly in IE

I am using PDF.js framework to render PDF. I am using base64 data to render PDF. But in IE 11 pdf looking blurry. See below screen from IE 11 See below code : var renderPDF = function(url, canvasContainer,data) { var scale= 0.9; //"zoom"…
Tushar Ahirrao
  • 12,669
  • 17
  • 64
  • 96
10
votes
0 answers

Same table is extracted twice from a pdf by Camelot-py

I am trying to extract tables from a multiple page PDF file using camelot-py v0.7.3. So far it has been the best pdf reader tool for me. I just needed to read pdf line by line and detect table manually. I tried many other tools such as tabula,…
mk09
  • 313
  • 2
  • 9
10
votes
1 answer

Parse PDF in Node.js

I am using meteor-react for uploading PDF docs to my Node.js backend, where I want to read the uploaded PDF doc, as a json, or whatever. Is it possible? And what library/tool would you recommended for that? Thank you!
peter
  • 345
  • 1
  • 2
  • 13
10
votes
2 answers

Looking for recommendation on how to convert PDF into structured format

I would like to do some analysis on some properties listed in an upcoming auction. Unfortunately, the city running the auction does not publish the information in a structured format but instead provides a 700+ page PDF of the properties going up…
doremi
  • 14,921
  • 30
  • 93
  • 148
10
votes
3 answers

Strange whitespaces when parsing a PDF

I need to parse a PDF document. I already implemented the parser and used the Library iText and till now it worked without any problems. But no I need to parse another document which gets very strange whitespaces in the middle of words. As example I…
Prine
  • 12,192
  • 8
  • 40
  • 59
9
votes
1 answer

Does Commercial use of GhostScript as Saas needs a licence ?

I was working on a project. In which a user can upload PDF and convert it into images and So that i have used GhostScript dll (gsdll32.dll). Now in my application i want to charge from users as monthly subscription so that i can provide them more…
objectWithoutClass
  • 1,631
  • 3
  • 14
  • 15
8
votes
3 answers

How to find Blank Page in pdf file

I can not detect blank page in pdf file. I have searched internet for it but could not find a good solution. Using Itextsharp I tried with page size, Xobjects. But they do not give exact result. I tried if(xobjects==null || textcontent==null…
Md Kamruzzaman Sarker
  • 2,387
  • 3
  • 22
  • 38
1
2 3
11 12