Questions tagged [pdfminer]

A python-based tool for extracting information from PDF documents.

PDFMiner is a tool for extracting information from PDF documents. Unlike other PDF-related tools, it focuses entirely on getting and analyzing text data. PDFMiner allows one to obtain the exact location of text in a page, as well as other information such as fonts or lines. It includes a PDF converter that can transform PDF files into other text formats (such as HTML). It has an extensible PDF parser that can be used for other purposes than text analysis.

Features

Written entirely in Python. (for version 2.4 or newer)
Parse, analyze, and convert PDF documents.
PDF-1.7 specification support. (well, almost)
CJK languages and vertical writing scripts support.
Various font types (Type1, TrueType, Type3, and CID) support.
Basic encryption (RC4) support.
PDF to HTML conversion (with a sample converter web app).
Outline (TOC) extraction.
Tagged contents extraction.
Reconstruct the original layout by grouping text chunks.

PDFMiner is about 20 times slower than other C/C++-based counterparts such as XPdf.

(source)

492 questions

114

votes

6 answers

Extracting text from a PDF file using PDFMiner in python?

I am looking for documentation or examples on how to extract text from a PDF file using PDFMiner with Python. It looks like PDFMiner updated their API and all the relevant examples I have found contain outdated code(classes and methods have…

asked Oct 21 '14 at 18:56

RattleyCooper

4,997
5
27
43

votes

15 answers

How do I use pdfminer as a library

I am trying to get text data from a pdf using pdfminer. I am able to extract this data to a .txt file successfully with the pdfminer command line tool pdf2txt.py. I currently do this and then use a python script to clean up the .txt file. I would…

python pdf pdfminer

asked Apr 20 '11 at 03:50

jmeich

votes

4 answers

How to extract text and text coordinates from a PDF file?

I want to extract all the text boxes and text box coordinates from a PDF file with PDFMiner. Many other Stack Overflow posts address how to extract all text in an ordered fashion, but how can I do the intermediate step of getting the text and text…

python pdf pdfminer

asked Apr 06 '14 at 18:31

pnj

1,349
1
11
14

votes

12 answers

How to check if PDF is scanned image or contains text

I have a large number of files, some of them are scanned images into PDF and some are full/partial text PDF. Is there a way to check these files to ensure that we are only processing files which are scanned images and not those that are full/partial…

python python-3.x pypdf pdfminer pdf-extraction

asked Apr 16 '19 at 08:54

Jinu Joseph

votes

10 answers

How to unlock a "secured" (read-protected) PDF in Python?

In Python I'm using pdfminer to read the text from a pdf with the code below this message. I now get an error message saying: File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfpage.py", line 124, in get_pages raise…

python pdf pdfminer pdf-scraping

asked Jan 28 '15 at 13:02

kramer65

50,427
120
308
488

votes

1 answer

How does one obtain the location of text in a PDF with PDFMiner?

PDFMiner's documentation says: PDFMiner allows one to obtain the exact location of text in a page However, I have not been able to find how to do this. PDFMiner's 'documentation' is rather sparse, so I have not understood how to do this.

python pdf position pdfminer

asked Aug 11 '14 at 16:35

technillogue

1,482
3
16
27

votes

5 answers

Pdfminer python 3.5

I have followed a few tutorials around but I am not able to get this code block to run, I did the necessary switches from StringIO to BytesIO (I believe?) I am unsure why 'banana' is printing nothing, I think the errors might be red herrings? is it…

python-3.x pdf text extract pdfminer

asked Oct 04 '16 at 14:24

gary

votes

1 answer

Highlight text in a PDF with Python

I'm working on custom search engine for my PDF data corpus. I have a transformation layer which is able to dump PDF content to text (using Apache Tika and GROBID). I have finished search layers and the view which return search results listing. Now,…

python pdf search pypdf pdfminer

asked Oct 27 '16 at 15:18

Katharsis

votes

7 answers

ImportError: cannot import name 'COMMON_SAFE_ASCII_CHARACTERS' from 'charset_normalizer.constant'

Traceback (most recent call last): File "g:\mydrive\ \pdftotext_pdfminer.py", line 3, in from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter File "C:\Users\ \anaconda3\envs\…

python importerror pdfminer

asked Nov 22 '22 at 15:47

Lena.J

votes

7 answers

PDFminer: extract text with its font information

I find this question, but it uses command line, and I do not want to call a Python script in command line using subprocess and parse HTML files to get the font information. I want to use PDFminer as a library, and I find this question, but they are…

python text-extraction pdfminer

asked Jan 05 '16 at 07:33

aristotll

8,694
6
33
53

votes

4 answers

PDFminer: PDFTextExtractionNotAllowed Error

I'm trying to extract text from pdfs I've scraped off the internet, but when I attempt to download them I get the error: File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfpage.py", line 124, in get_pages raise…

python pdf text nlp pdfminer

asked Oct 11 '16 at 16:18

Tyler Lazoen

votes

6 answers

Extract hyperlinks from PDF in Python

I have a PDF document with a few hyperlinks in it, and I need to extract all the text from the pdf. I have used the PDFMiner library and code from http://www.endlesslycurious.com/2012/06/13/scraping-pdf-with-python/ to extract text. However, it…

python pdf hyperlink pypdf pdfminer

asked Jan 02 '15 at 15:08

Randomly Named User

1,889
7
27
47

votes

2 answers

Extract text per page with Python pdfMiner?

I have experimented with both pypdf and pdfMiner to extract text from PDF files. I have some unfriendly PDFs that only pdfMiner is able to extract successfully. I am using the code here to extract text for the entire file. However, I would really…

python pdf pdfminer

asked Sep 26 '12 at 15:24

user1642513

votes

1 answer

Import error : cannot import name 'open_filename' from 'pdfminer.utils'

On importing pdfminer.high_level, I am getting an error cannot import name open_filename from pdfminer.utils. I tried following steps: pip3 install pdfminer.six import pdfminer import pdfminer.high_level (and encountered error on this…

python pdfminer

asked Apr 07 '21 at 17:11

Neha Narang

votes

1 answer

ModuleNotFoundError: No module named 'pdfminer.high_level'

I work on project in pycharm , i'd like to use pdfminer in order to convert a pdf file to a text file. My problem is when i run the app i't doesn't work and it display this error message : ModuleNotFoundError: No module named…

python pdfminer

asked Sep 23 '22 at 18:02

oran ben david

2 3

…

32 33 Next