15

I'm trying to extract text from pdfs I've scraped off the internet, but when I attempt to download them I get the error:

File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfpage.py", line 124, in get_pages
    raise PDFTextExtractionNotAllowed('Text extraction is not allowed: %r' % fp)
PDFTextExtractionNotAllowed: Text extraction is not allowed <cStringIO.StringO object at 0x7f79137a1ab0>

I've checked stackoverflow and someone else who had this error found their pdfs to be secured with a password. However, I'm able to access the pdfs through preview on my mac.

Someone mentioned that preview may view secured pdfs anyway, so I opened the files in Adobe Acrobat Reader as well and was still able to access the pdf.

Here's an example from the site I'm downloading pdfs from: http://www.sophia-project.org/uploads/1/3/9/5/13955288/aristotle_firstprinciples.pdf

I discovered that if I open the pdf manually and re-export it as a pdf to the same filepath (basically replacing the original with a 'new' file), then I am able to extract text from it. I'm guessing it has something to do with downloading them from the site. I'm simply using urllib to download the pdfs as follows:

if not os.path.isfile(filepath):
    print '\nDownloading pdf'
    urllib.urlretrieve(link, filepath)
else:
    print '\nFile {} already exists!'.format(title)

I also tried rewriting the file to a new filepath, but it still resulted in the same error.

if not os.path.isfile(filepath):
    print '\nDownloading pdf'
    urllib.urlretrieve(link, filepath)

    with open(filepath) as f:
        new_filepath = re.split(r'\.', filepath)[0] + '_.pdf'
        new_f = file(new_filepath, 'w')
        new_f.write(f.read())
        new_f.close()

    os.remove(filepath)
    filepath = new_filepath
else:
    print '\nFile {} already exists!'.format(title)

Lastly, here is the function I'm using to extract the text.

def convert(fname, pages=None):
    '''
    Get text from pdf
    '''
    if not pages:
        pagenums = set()
    else:
        pagenums = set(pages)

    output = StringIO()
    manager = PDFResourceManager()
    converter = TextConverter(manager, output, laparams=LAParams())
    interpreter = PDFPageInterpreter(manager, converter)

    infile = file(fname, 'rb')
    try:
        for page in PDFPage.get_pages(infile, pagenums):
            interpreter.process_page(page)
    except PDFTextExtractionNotAllowed:
        print 'This pdf won\'t allow text extraction!'

    infile.close()
    converter.close()
    text = output.getvalue()
    output.close

    return text

Is there any way I can programmatically solve this rather than manually re-exporting the files in preview?

Community
  • 1
  • 1
Tyler Lazoen
  • 159
  • 1
  • 1
  • 4
  • 1
    This is most likely the same issue as the previous question. Most previewers will by convention open an encrypted/protected PDF with a blank user password. – dwarring Oct 11 '16 at 17:45
  • thanks for adding an example PDF. This is encrypted and is copy-protected against printing `pdfinfo.exe aristotle_firstprinciples.pdf |grep Encr` ... `Encrypted: yes (print:yes copy:no change:no addNotes:no algorithm:RC4)` – dwarring Oct 12 '16 at 00:31
  • @dwarring Thank you for the response! Is there any way to change this through programming or with the command line? – Tyler Lazoen Oct 12 '16 at 01:47
  • 9
    [Answer to Previous Question](http://stackoverflow.com/questions/28192977/how-to-unlock-a-secured-read-protected-pdf-in-python). suggests `qpdf`. Password is blank in this case: `qpdf --decrypt --password='' encrypted.pdf decrypted.pdf` – dwarring Oct 12 '16 at 03:06
  • Thanks so much!!!!!!! It worked! – Tyler Lazoen Oct 12 '16 at 04:46

4 Answers4

11

More recent versions of PDFMiner has the check_extractable parameter. You can use it on get_pages method:

fp = open(filename, 'rb')
PDFPage.get_pages(fp,check_extractable=False)
Josir
  • 1,282
  • 22
  • 35
4

I also have this error. It was solved as below.

Before

for page in PDFPage.get_pages(fp, caching=caching, check_extractable=True): interpreter.process_page(page)

After

for page in PDFPage.get_pages(fp, caching=caching, check_extractable=False): interpreter.process_page(page)

Yu Wai
  • 41
  • 1
1

I have also had same error. I have used PyMuPDF package and it worked.

marcin
  • 517
  • 4
  • 23
-2

I have come across this error too, so have incorporated tika to run locally which if pdfminer fails to extract any data, I pass it to tika. Works fine.

Hairy
  • 101
  • 7