3

I am trying to extract text from pdf using pdfminer.six library (like here), I have already installed it in my virtual environment. here is my code :

import pdfminer as miner

text = miner.high_level.extract_text('file.pdf')


print(text)  

but when I execute the code with python pdfreader.py I get the following error :

Traceback (most recent call last):
  File ".\pdfreader.py", line 9, in <module>
    text = miner.high_level.extract_text('pdfBulletins/corona1.pdf')
AttributeError: module 'pdfminer' has no attribute 'high_level'  

I suspect it has something to do with the Python path, because I installed pdfminer inside my virtual environment, but I see that this installed pdf2txt.py outside in my system python install. Is this behavior normal? I mean something that happens inside my venv should not alter my system Python installation.

I successfully extracted the text using pdf2txt.py utility that comes with pdfminer.six library (from command line and using system python install), but not from the code inside my venv project. My pdfminer.six version is 20201018

What could be the problem with my code ?

Red
  • 26,798
  • 7
  • 36
  • 58
mounaim
  • 1,132
  • 7
  • 29
  • 56
  • 1
    Does this answer help? https://stackoverflow.com/a/26495057/14316282 – Rolv Apneseth Nov 09 '20 at 23:54
  • 1
    @RolvApneseth tried the code there, does not work, I am suspecting it has something to do with Python path, because I installed pdfminer inside my virtual environment, but I see that this installed pdf2txt.py outside in my system python install, is this behaviour normal ? I mean something that happens inside my venv should not alter my system python installation – mounaim Nov 10 '20 at 13:58
  • 1
    That behaviour is certainly not normal. Are any other modules you installed installing on the system rather than in the virtual environment? – Rolv Apneseth Nov 10 '20 at 15:37

5 Answers5

3

pdfminer high_level extract_text requires additional parameters to work properly. This code below uses pdfminer.six and it extracts the text from my pdf files.

from pdfminer.high_level import extract_text

pdf_file = open('my_file.pdf', 'rb')
text = extract_text(pdf_file, password='', page_numbers=None, maxpages=0, caching=True, codec='utf-8', laparams=None)
print(text)

Here are a couple of additional posts that I wrote on extracting text from PDF files that might be useful:

Life is complex
  • 15,374
  • 5
  • 29
  • 58
2

Your problem is trying to use a function from a module you have not imported. Importing pdfminer does NOT automatically also import pdfminer.high_level.

This works:

from pdfminer.high_level import extract_text

text = extract_text('file.pdf')

print(text)
traal
  • 461
  • 3
  • 7
0

Try pdfreader to extract texts (plain and containing PDF operators) from PDF document

Here is a sample code extracting all the above from all document pages.

from pdfreader import SimplePDFViewer, PageDoesNotExist

fd = open(you_pdf_file_name, "rb")
viewer = SimplePDFViewer(fd)

plain_text = ""
pdf_markdown = ""
try:
    while True:
        viewer.render()
        pdf_markdown += viewer.canvas.text_content
        plain_text += "".join(viewer.canvas.strings)
        viewer.next()
except PageDoesNotExist:
    pass

Maksym Polshcha
  • 18,030
  • 8
  • 52
  • 77
-1

You'll need to install pdfminer.six instead of just pdfminer:

pip install pdfminer.six

Only after that, you can import extract_text as:

from pdfminer.high_level import extract_text
Red
  • 26,798
  • 7
  • 36
  • 58
-1

Problem in my case

pdfminer and pdfminer.six are both installed, from pdfminer.high_level import extract_text than tries to use the wrong package.

Solution

For me uninstalling pdfminer worked:

pip uninstall pdfminer

now you should only have pdfminer.six installed and should be able to import extract_text.

Jacob-Jan Mosselman
  • 801
  • 1
  • 10
  • 18