Trouble using pdfminer.six while parsing pdf files

Question

I am trying to extract text from pdf using pdfminer.six, I followed below code as mentioned here

import pdfminer
import io

def extract_raw_text(pdf_filename):
    output = io.StringIO()
    laparams = pdfminer.layout.LAParams()

    with open(pdf_filename, "rb") as pdffile:
        pdfminer.high_level.extract_text_to_fp(pdffile, output, laparams=laparams)

    return output.getvalue()

print(extract_raw_text('simple1.pdf'))

But it is producing an error

Traceback (most recent call last):
  File "extract.py", line 13, in <module>
    print(extract_raw_text('simple1.pdf'))
  File "extract.py", line 6, in extract_raw_text
    laparams = pdfminer.layout.LAParams()
AttributeError: module 'pdfminer' has no attribute 'layout'

I simply wants to extract entire text from pdf, any help would be appreciated.

score 3 · Answer 1 · answered Apr 09 '19 at 16:15

I was having the same issue! Probably it's an issue with the new updates, as python is not recognising the extra files as modules, since they are not well categorised.

So all you need is to import the specific file directly, and you can do it in three ways:

Inside your code (whole module)

Instead of using import pdfminer, import the specific modules you'd like to use

import pdfminer.layout
import pdfminer.high_level

This way you can access all the module's classes directly as you did in
laparams = pdfminer.layout.LAParams()

Inside your code (specific classes / functions)

The same logic applies, but here, we will only select the specific classes we want to use inside of each module (in your case you've used the classes / functions LAParams() and extract_text_to_fp

So you'd do:

from pdfminer.layout import LAParams
from pdfminer.high_level import extract_text_to_fp

On the module itself (to fix for every use)

This is a killer solution, but not optimum, since you might lose those changes for each time you update your module. However useful if you use a lot this module.

Find your site-packages location Write on your terminal python -m site, and you'll have all the paths found. Look fo the one who finishes like this ...lib/python3.6/site-packages
Find your pdfminer module, open the folder and open the __init__.py file
Write the code for all the modules you'd like to let pre-loaded, such as:

import pdfminer.layout import pdfminer.high_level

Now, every time you use import pdfminer, those modules will be pre-loaded as well, so you could run your code as you wrote above, and it'll work.

score 0 · Answer 2 · answered Oct 30 '20 at 14:12

0

This solved the problem for me.

pip install --upgrade camelot-py

answered Oct 30 '20 at 14:12

druskacik

2,176
2
13
26

Trouble using pdfminer.six while parsing pdf files

2 Answers2

Inside your code (whole module)

Inside your code (specific classes / functions)

On the module itself (to fix for every use)