0

I want to extract text from a pdf file, tried:

directory = r'C:\Users\foo\folder'

for x in os.listdir(directory):
    print(x)
    x = x.replace('.pdf','')
    filename = os.fsdecode(x)
    print(x)

    if filename.endswith('.pdf'):
        with pdfplumber.open(x) as pdf1:
            page1 = pdf1.pages[0]
            text1 = page1.extract_text()
            print(text1)

and it printed:

20170213091544343.pdf
20170213091544343

Seeing the file has a name of 20170213091544343, I added:


    else:
        with pdfplumber.open(x) as pdf1:
                page1 = pdf1.pages[0]
                text1 = page1.extract_text()
                print(text1)
            

to read the file in case the file name doesn't have .pdf and it caught error:


20170213091544343.pdf
20170213091544343
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-34-e370b214f9ba> in <module>
     16 
     17     else:
---> 18         with pdfplumber.open(x) as pdf1:
     19                 page1 = pdf1.pages[0]
     20                 text1 = page1.extract_text()

C:\Python38\lib\site-packages\pdfplumber\pdf.py in open(cls, path_or_fp, **kwargs)
     56     def open(cls, path_or_fp, **kwargs):
     57         if isinstance(path_or_fp, (str, pathlib.Path)):
---> 58             fp = open(path_or_fp, "rb")
     59             inst = cls(fp, **kwargs)
     60             inst.close = fp.close

FileNotFoundError: [Errno 2] No such file or directory: '20170213091544343'
nilsinelabore
  • 4,143
  • 17
  • 65
  • 122

1 Answers1

2

os.listdir() gives only filename and you have to join it with directory

for filename in os.listdir(directory):

    fullpath = os.path.join(directory, filename)

    #print(fullpath)

And you have to keep exension .pdf

import os
import pdfplumber

directory = r'C:\Users\foo\folder'

for filename in os.listdir(directory):
    if filename.endswith('.pdf'):

        fullpath = os.path.join(directory, filename)
        #print(fullpath)

        #all_text = ""

        with pdfplumber.open(fullpath) as pdf:
            for page in pdf.pages:
                text = page.extract_text()
                print(text)
                #all_text += text

        #print(all_text)

or with page number

        with pdfplumber.open(fullpath) as pdf:
            for number, page in enumerate(pdf.pages, 1):
                print('--- page', number, '---')
                text = page.extract_text()
                print(text)
furas
  • 134,197
  • 12
  • 106
  • 148
  • Hi furas, thanks for your answer. But it returned `None` when there is clearly content in the page? – nilsinelabore Jun 22 '21 at 04:38
  • it can means there is no text on page. It may have all as images. – furas Jun 22 '21 at 04:39
  • I see, is it possible to extract text from image? – nilsinelabore Jun 22 '21 at 04:40
  • 1
    it would need some `OCR` program - like `tesseract` (created by Google) and module `pytesseract` - to convert image to text. – furas Jun 22 '21 at 04:41
  • I see, I tested this solution on another pdf file, however, it seems to produce text from another file, any thought as to what might go wrong? – nilsinelabore Jun 22 '21 at 04:44
  • 1
    BTW: you check only first page - if text is on next pages then you should use `for`-loop – furas Jun 22 '21 at 04:44
  • check `print(fullpath)` and use this path in any PDF viewer to check if you really open correct file. Maybe it gives different text then you expect because you open different file. – furas Jun 22 '21 at 04:46
  • I added `for`-loop in answer – furas Jun 22 '21 at 04:47
  • Indeed, it printed a different fullpath, although I've specified the file name I want. It's weird that it seems like the first file in the list of `os.listdir(directory)`, which in my case is `Downloads` and has a myriad of files. So I changed `if filename.endswith('.pdf'):` into `if filename.endswith('the_complete_name_of_the_file_i_want.pdf'):` and it seems to work – nilsinelabore Jun 22 '21 at 04:55
  • 1
    if you want to use `'the_complete_name_of_the_file_i_want.pdf` then you don't need to check `endswith()` and you don't need `os.listdir()` but use directly `open('directory/the_complete_name_of_the_file_i_want.pdf')` – furas Jun 22 '21 at 05:00
  • Thanks furas, I searched for some `pytesseract` examples as you suggested such as [this one](https://stackoverflow.com/a/56292713/11901732) but could never get `ImageMagick ` installed, is it necessary for me to use `ImageMagick ` if I want to extract text from image in pdf? – nilsinelabore Jun 22 '21 at 05:35
  • 1
    I don't remeber if this need `ImageMagick` - first install `tessract` from Google and test it without Python - directly in console `tesseract.exe file.pdf` – furas Jun 22 '21 at 06:20
  • 1
    it may needs `ImageMagick` to convert every page in `pdf` into image - because `tesseract` may need images. On ImageMagick's page you can see many [installers for Windows](https://imagemagick.org/script/download.php#windows) – furas Jun 22 '21 at 06:26
  • Thank you for the advice, I'll have a look. – nilsinelabore Jun 22 '21 at 09:20