Python: Extract text from multiple pdf and paste on excel

Question

i'm a total new in python, could you help me correct this code?

I would like to add 2 things:

do the operation on multiple pdf and not just one and pasting the content in A2,A3 A4 and so on
if possible writing in the another row (B2,B3,B4) the name of the pdf file.

Thank you in advance, this is the code i'm working with

import PyPDF2
import openpyxl
pdfFileObj = open("file.pdf", 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pdfReader.numPages

pageObj = pdfReader.getPage(0)
mytext = pageObj.extractText()
wb = openpyxl.load_workbook('excel.xlsx')
sheet = wb.active
sheet.title = 'MyPDF'
sheet['A1'] = mytext

wb.save('excel.xlsx')
print('DONE!!')

I've modified the code as suggested and the cycle seems to get all the pages! but maybe i have to work with "sheet[f'A{row}'].value = '\n'.join(output)" because it seems to print a lot of spaces


import PyPDF2
import openpyxl
import os
import glob
root_dir = "your directory"

filenames = []
# root_dir needs a trailing slash (i.e. /root/dir/)
for filename in glob.iglob(root_dir + '**/**', recursive=True):
    if filename.lower().endswith('.pdf'):
        filenames.append(os.path.join(directory, filename))
        

wb = openpyxl.load_workbook('excel.xlsx')#your file excel
sheet = wb.active
sheet.title = 'MyPDF'

for row, filename in enumerate(filenames, start=1):
    with open(filename, 'rb') as f:
        pdfReader = PyPDF2.PdfFileReader(f)
        count=pdfReader.numPages
        pageObj = pdfReader.getPage(0)
        mytext = pageObj.extractText()
        for i in range(count): 
            page = pdfReader.getPage(i)
            output = []
            output = page.extractText() 
            print(output)

    sheet[f'A{row}'].value = '\n'.join(output)
    sheet[f'B{row}'].value = filename

wb.save('excel.xlsx') #your file excel
print('DONE!!')

GordonAitchJay · Accepted Answer · 2022-06-14T10:25:19.540

0

You basically want to put the code you wrote which reads the pdf file into a for loop which iterates over the filenames (in this case, the filenames are stored as a tuple).

Using enumerate, row increments every iteration of the loop, and starts at 1. So the text and filename will be put into A1 and B1, then A2 and B2, and so on.

import PyPDF2
import openpyxl

filenames = ("file.pdf", 
             "file2.pdf", 
             "file3.pdf", 
            )

wb = openpyxl.load_workbook('excel.xlsx')
sheet = wb.active
sheet.title = 'MyPDF'

for row, filename in enumerate(filenames, start=1):
    with open(filename, 'rb') as f:
        pdfReader = PyPDF2.PdfFileReader(f)
        pdfReader.numPages
        pageObj = pdfReader.getPage(0)
        mytext = pageObj.extractText()
    
    sheet[f'A{row}'].value = mytext
    sheet[f'B{row}'].value = filename

wb.save('excel.xlsx')
print('DONE!!')

You can get a list of all the filenames ending in .pdf quite easily by iterating over all the files in a directory, and checking if the filename ends in .pdf. If it does, use os.path.join to give you the full filepath, and append it to the filenames list.

You could also use the glob module, too.

import os

filenames = []
directory = r"C:\Stuff\PDF Files"
for filename in os.listdir(directory):
    if filename.lower().endswith(".pdf"):
        filenames.append(os.path.join(directory, filename))

Updated code:

import PyPDF2
import openpyxl
import os
import glob
import re
import itertools

# Used to strip characters that can't be written to a spreadsheet
# See https://stackoverflow.com/a/93029/3589122
control_chars = ''.join(map(chr, itertools.chain(range(0x00,0x20), range(0x7f,0xa0))))
control_char_re = re.compile('[%s]' % re.escape(control_chars))

def remove_control_chars(s):
    return control_char_re.sub('', s)

root_dir = 'your directory' # root_dir needs a trailing slash (i.e. /root/dir/)

filenames = (filename for filename in glob.iglob(root_dir + '/**/*.pdf', recursive=True))

wb = openpyxl.load_workbook('excel.xlsx') # your file excel
sheet = wb.active
sheet.title = 'MyPDF'

row = 1
for filename in filenames:
    with open(filename, 'rb') as f:
        try:
            pdfReader = PyPDF2.PdfFileReader(f)
            count = pdfReader.numPages
            
            output = []
            for i in range(count): 
                print(i, filename)
                page = pdfReader.getPage(i)
                output.append(page.extractText())
                #print(output)
        except Exception as e:
            print(f'Error: PyPDF2 could not read {filename}. Continuing... ({e})')
            continue
    
    sheet[f'A{row}'].value = '\n'.join(remove_control_chars(output))
    sheet[f'B{row}'].value = filename
    row += 1

wb.save('excel.xlsx') #your file excel
print('DONE!!')

edited Jun 14 '22 at 10:25

answered Jun 13 '22 at 10:32

GordonAitchJay

4,640
1
14
16

Thanks for your answer GordonAitchJay, there is a way to "automatically" take all pdfs in a folder without having to name them? something like *pdf to say everything that ends with pdf? This is why i wanted also to add the name of the file in another column – Gabry Jun 13 '22 at 11:24
You best believe it! See updated answer. – GordonAitchJay Jun 13 '22 at 11:35
First of all thank you for helping me! But the script return an empy file, if i print what's in mytext I see a part of the copied text so probably the problem is when trying to writing it down to excel? – Gabry Jun 13 '22 at 12:06
That's strange. I just tried it now, and it works for me. Try appending `.value` to `sheet[f'A{row}']` and `sheet[f'B{row}']`, so it's like `sheet[f'A{row}'].value`. See my edited answer. – GordonAitchJay Jun 13 '22 at 12:21
Sadly no, it doesnt paste anything i dont understand why the code i pasted work and why yours which is way better does not paste anything! – Gabry Jun 13 '22 at 12:35
Maybe is something related to the second part of the code? i dont use filenames = ("file.pdf", "file2.pdf", "file3.pdf", ) anymore because i'm using your second part of the script for saving them – Gabry Jun 13 '22 at 12:38
I edited my answer with the 2 bits of code combined. It worked for me. Just change the value of `directory`. – GordonAitchJay Jun 13 '22 at 12:46
worked like a charm! I dont want to bother you anymore so feel free not to answer :) I'm working on 2 things, glob as suggested (to take more pdf in subdirectories) and more important taking more page of the pdf. So what i'm trying is adding something like this for i in range(count): page = pdfReader.getPage(i) output.append(page.extractText()) print(output) – Gabry Jun 13 '22 at 13:03
No worries. Yeah that approach seems good - put all the text into a list. You could then chuck it in 1 cell with `sheet[f'A{row}'].value = '\n'.join(output)` or put each page in separate columns or rows. – GordonAitchJay Jun 13 '22 at 13:12
Thanks again! the for cycle worked and now the script gets all the pages. The strange effect is that in excel now every letter seems to have a spacing, so everything it's in vertical – Gabry Jun 13 '22 at 13:30
Okay, so if you did `'\n'.join(output)`, `output` here is supposed to be a `list` of strings (pages of text), but for you it's a `str`, so it's adding a newline between every character, instead of between every page of text. – GordonAitchJay Jun 13 '22 at 13:53
what do you suggest? also because thank to you i've modified again my script to catch all the pdf in subdirectories (glob worked perfeclty so thanks again) but if i try to process more than 6-7 pdf at once the notebook raise this error TypeError: a bytes-like object is required, not 'dict'. And it happens only if I try with more than 6 pdf ad once strange no? I'll edit my first post with the new script – Gabry Jun 13 '22 at 14:20
You need to move `output = []` outside of (just above) the loop `for i in range(count):`, then change `output = page.extractText()` to `output.append(page.extractText())`. As it currently is, only the last page of text is saved. You can use `glob` to directly filter `.pdf` files, instead of doing it manually with the `if` statement. Try my updated answer. – GordonAitchJay Jun 13 '22 at 15:13

score 0 · Answer 2 · answered Jun 13 '22 at 16:18

0

have you tried with more than 6/7 files? i get this error with 7 pdf


TypeError                                 Traceback (most recent call last)
<ipython-input-14-07fb0aa603b8> in <module>
     23         for i in range(count):
     24             page = pdfReader.getPage(i)
---> 25             output.append(page.extractText())
     26             print(output)
     27 

~\anaconda3\lib\site-packages\PyPDF2\_page.py in extractText(self, Tj_sep, TJ_sep)
   1283         """
   1284         deprecate_with_replacement("extractText", "extract_text")
-> 1285         return self.extract_text(Tj_sep=Tj_sep, TJ_sep=TJ_sep)
   1286 
   1287     mediabox = _create_rectangle_accessor(PG.MEDIABOX, ())

~\anaconda3\lib\site-packages\PyPDF2\_page.py in extract_text(self, Tj_sep, TJ_sep, space_width)
   1261         :return: a string object.
   1262         """
-> 1263         return self._extract_text(self, self.pdf, space_width, PG.CONTENTS)
   1264 
   1265     def extract_xform_text(

~\anaconda3\lib\site-packages\PyPDF2\_page.py in _extract_text(self, obj, pdf, space_width, content_key)
   1243                     text = ""
   1244             else:
-> 1245                 process_operation(operator, operands)
   1246         output += text  # just in case of
   1247         return output

~\anaconda3\lib\site-packages\PyPDF2\_page.py in process_operation(operator, operands)
   1195                 tm_matrix[5] -= TL
   1196             elif operator == b"Tj":
-> 1197                 text += operands[0].translate(cmap)
   1198             else:
   1199                 return None

TypeError: a bytes-like object is required, not 'dict'

answered Jun 13 '22 at 16:18

Gabry

33
1
5

1

That's caused by a particular file (i.e. the 6th file). `PyPDF2` is unable to extract the text, possible because the pdf file is corrupt, or because there is a bug in PyPDF2. Maybe try updating `PyPDF2`, or try another pdf library such as `PyMuPDF` - https://pymupdf.readthedocs.io/en/latest/textpage.html#TextPage.extractText – GordonAitchJay Jun 13 '22 at 16:41
Failing that, you can just ignore the error with a try-except statement. – GordonAitchJay Jun 13 '22 at 16:43
I inserted the try-except statement as suggested but it just stops after the first 6-7 file (try: for row, filename in enumerate(filenames, start=1): with open(filename, 'rb') as f: pdfReader = PyPDF2.PdfFileReader(f) count = pdfReader.numPages output = [] for i in range(count): page = pdfReader.getPage(i) output.append(page.extractText()) print(output) except Exception: pass So i'll try with PyMuPDF but the syntax seems very different – Gabry Jun 14 '22 at 07:30
So PyPDF2 cannot read the 6th or 7th file. Is this file corrupt? Can you view the pdf in another application? Nevertheless, because you had the entire loop inside the `try` block, when an exception was raised, it completely exited the loop. That's why it stops after the 6th or 7th file. You should only include the line of code that raises the exception inside the `try` block. I've edited my answer, try that. Note I had to remove `enumerate` and manually set and update `row`, otherwise there would be a gap in the spreadsheet each time PyPDF2 can't read a file. – GordonAitchJay Jun 14 '22 at 08:09
Thanks for help as always. I've tried removing the 5th or 6th or 7th file but still the same error even with the try except (TypeError: a bytes-like object is required, not 'dict'). Maybe when it reach a certain level of text or \n's it stops? – Gabry Jun 14 '22 at 08:42
No problemo. So I tried it myself on 150 random pdf files I had and PyPDF2 failed to read some, and in some cases `extractText()` returns characters that can't be written to a spreadsheet. See my updated answer. – GordonAitchJay Jun 14 '22 at 10:22
It raises a new kind of error : # root_dir needs a trailing slash (i.e. /root/dir/) (which it is i'll paste it there ("C:/Users/gabri/Desktop/Testscript") Until now I just pointed to a folder on desktop which used to work. Does this means the pdf files should be in the same directory of python? – Gabry Jun 14 '22 at 12:22
Post the full traceback. No, the pdf files just need to be in root_dir or a subdirectory. – GordonAitchJay Jun 14 '22 at 12:25
Traceback (most recent call last) in 39 continue 40 ---> 41 sheet[f'A{row}'].value = '\n'.join(remove_control_chars(output)) 42 sheet[f'B{row}'].value = filename 43 row += 1 in remove_control_chars(s) 12 13 def remove_control_chars(s): ---> 14 return control_char_re.sub('', s) 15 16 root_dir = "C:/Users/gabri/Desktop/Testscript" # root_dir needs a trailing slash (i.e. /root/dir/) TypeError: expected string or bytes-like object – Gabry Jun 14 '22 at 12:26
it seems to try to open the same file : 0 C:/Users/gabri/Desktop/Testscript/90300_assignsubmission_file_\Infa_210518.pdf 1 C:/Users/gabri/Desktop/Testscript/90300_assignsubmission_file_\Infa_210518.pdf 2 C:/Users/gabri/Desktop/Testscript/I90300_assignsubmission_file_\Infa_210518.pdf – Gabry Jun 14 '22 at 12:36
Sorry, that error was my fault because I put the call to `remove_control_chars` in the wrong place. Change this line: `sheet[f'A{row}'].value = '\n'.join(remove_control_chars(output))` to this: `sheet[f'A{row}'].value = remove_control_chars('\n'.join(output))`. I've edited my answer. – GordonAitchJay Jun 14 '22 at 16:48
It opens the file once, and extracts the text for each page one at a time. The print statement prints the page number (starting at 0) and filename every new page. – GordonAitchJay Jun 14 '22 at 16:49
You are my hero and saviour! everything worked perfectly! I really envy your ability. I don't know how to thank you for the time you dedicated helping me – Gabry Jun 15 '22 at 09:04
haha it was quite the journey wasn't it? I'm just happy to help. Cheers – GordonAitchJay Jun 15 '22 at 09:39

Python: Extract text from multiple pdf and paste on excel

2 Answers2