Questions tagged [pdfminer]

A python-based tool for extracting information from PDF documents.

PDFMiner is a tool for extracting information from PDF documents. Unlike other PDF-related tools, it focuses entirely on getting and analyzing text data. PDFMiner allows one to obtain the exact location of text in a page, as well as other information such as fonts or lines. It includes a PDF converter that can transform PDF files into other text formats (such as HTML). It has an extensible PDF parser that can be used for other purposes than text analysis.

Features

  • Written entirely in Python. (for version 2.4 or newer)
  • Parse, analyze, and convert PDF documents.
  • PDF-1.7 specification support. (well, almost)
  • CJK languages and vertical writing scripts support.
  • Various font types (Type1, TrueType, Type3, and CID) support.
  • Basic encryption (RC4) support.
  • PDF to HTML conversion (with a sample converter web app).
  • Outline (TOC) extraction.
  • Tagged contents extraction.
  • Reconstruct the original layout by grouping text chunks.

PDFMiner is about 20 times slower than other C/C++-based counterparts such as XPdf.

(source)

492 questions
114
votes
6 answers

Extracting text from a PDF file using PDFMiner in python?

I am looking for documentation or examples on how to extract text from a PDF file using PDFMiner with Python. It looks like PDFMiner updated their API and all the relevant examples I have found contain outdated code(classes and methods have…
RattleyCooper
  • 4,997
  • 5
  • 27
  • 43
74
votes
15 answers

How do I use pdfminer as a library

I am trying to get text data from a pdf using pdfminer. I am able to extract this data to a .txt file successfully with the pdfminer command line tool pdf2txt.py. I currently do this and then use a python script to clean up the .txt file. I would…
jmeich
  • 885
  • 1
  • 7
  • 8
56
votes
4 answers

How to extract text and text coordinates from a PDF file?

I want to extract all the text boxes and text box coordinates from a PDF file with PDFMiner. Many other Stack Overflow posts address how to extract all text in an ordered fashion, but how can I do the intermediate step of getting the text and text…
pnj
  • 1,349
  • 1
  • 11
  • 14
36
votes
12 answers

How to check if PDF is scanned image or contains text

I have a large number of files, some of them are scanned images into PDF and some are full/partial text PDF. Is there a way to check these files to ensure that we are only processing files which are scanned images and not those that are full/partial…
Jinu Joseph
  • 542
  • 1
  • 4
  • 17
31
votes
10 answers

How to unlock a "secured" (read-protected) PDF in Python?

In Python I'm using pdfminer to read the text from a pdf with the code below this message. I now get an error message saying: File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfpage.py", line 124, in get_pages raise…
kramer65
  • 50,427
  • 120
  • 308
  • 488
23
votes
1 answer

How does one obtain the location of text in a PDF with PDFMiner?

PDFMiner's documentation says: PDFMiner allows one to obtain the exact location of text in a page However, I have not been able to find how to do this. PDFMiner's 'documentation' is rather sparse, so I have not understood how to do this.
technillogue
  • 1,482
  • 3
  • 16
  • 27
21
votes
5 answers

Pdfminer python 3.5

I have followed a few tutorials around but I am not able to get this code block to run, I did the necessary switches from StringIO to BytesIO (I believe?) I am unsure why 'banana' is printing nothing, I think the errors might be red herrings? is it…
gary
  • 223
  • 1
  • 2
  • 8
20
votes
1 answer

Highlight text in a PDF with Python

I'm working on custom search engine for my PDF data corpus. I have a transformation layer which is able to dump PDF content to text (using Apache Tika and GROBID). I have finished search layers and the view which return search results listing. Now,…
Katharsis
  • 239
  • 1
  • 2
  • 8
18
votes
7 answers

ImportError: cannot import name 'COMMON_SAFE_ASCII_CHARACTERS' from 'charset_normalizer.constant'

Traceback (most recent call last): File "g:\mydrive\ \pdftotext_pdfminer.py", line 3, in from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter File "C:\Users\ \anaconda3\envs\…
Lena.J
  • 209
  • 1
  • 2
  • 4
16
votes
7 answers

PDFminer: extract text with its font information

I find this question, but it uses command line, and I do not want to call a Python script in command line using subprocess and parse HTML files to get the font information. I want to use PDFminer as a library, and I find this question, but they are…
aristotll
  • 8,694
  • 6
  • 33
  • 53
15
votes
4 answers

PDFminer: PDFTextExtractionNotAllowed Error

I'm trying to extract text from pdfs I've scraped off the internet, but when I attempt to download them I get the error: File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfpage.py", line 124, in get_pages raise…
Tyler Lazoen
  • 159
  • 1
  • 1
  • 4
15
votes
6 answers

Extract hyperlinks from PDF in Python

I have a PDF document with a few hyperlinks in it, and I need to extract all the text from the pdf. I have used the PDFMiner library and code from http://www.endlesslycurious.com/2012/06/13/scraping-pdf-with-python/ to extract text. However, it…
Randomly Named User
  • 1,889
  • 7
  • 27
  • 47
15
votes
2 answers

Extract text per page with Python pdfMiner?

I have experimented with both pypdf and pdfMiner to extract text from PDF files. I have some unfriendly PDFs that only pdfMiner is able to extract successfully. I am using the code here to extract text for the entire file. However, I would really…
user1642513
13
votes
1 answer

Import error : cannot import name 'open_filename' from 'pdfminer.utils'

On importing pdfminer.high_level, I am getting an error cannot import name open_filename from pdfminer.utils. I tried following steps: pip3 install pdfminer.six import pdfminer import pdfminer.high_level (and encountered error on this…
Neha Narang
  • 131
  • 1
  • 3
12
votes
1 answer

ModuleNotFoundError: No module named 'pdfminer.high_level'

I work on project in pycharm , i'd like to use pdfminer in order to convert a pdf file to a text file. My problem is when i run the app i't doesn't work and it display this error message : ModuleNotFoundError: No module named…
oran ben david
  • 119
  • 1
  • 1
  • 5
1
2 3
32 33