I am trying to loop through a set of pdfs (all are OCR'd) in a set of folders and search for key terms in the pdf and if pdf contains a certain term, then save the folder name, file name, etc.. This code is working to an extent. Except, it is missing a few pdfs within the search terms. The reason is because when I read in a couple of the pdfs it displays some jibberish (to me at least) on a couple of pages. For example, say I have read in a pdf named 'the_one.pdf'. It has 278 pages. When I go into adobe acrobat to search this document, I can find 'Search Term 1' on page 171, but when it is read with python, python outputs something like this:
-ˆ˜
%
˜%˝ˆ
,˙
˚
%.
%,˛#
%˜˚
0"
˚˝
%
˚˝ˆ˙)˛˚˜
˚0˛˚
:&;
#˛˘˘˙
˚%˚
"
%˚˛˘
ˆ
˛˚,˚
"
$%˚˚%
%
˝%.
"˛
"
%˜
˝,
-ˆ
%˘˙
˛˘˚
0"
"
˛
.˛˝
%˜˚
˝˜
.%
!˝ˆ%
4
0"
"
%˜˚
˛
%˛˘˘˙
!˝ˆ˜
%
˛ ˚˝ˆ˙)˛˚˜
˚0˛
!˝ˆ%
.˛˝˘˙8
˛˜
%
0"
"
˚
˛ #%˛%
"˛
˚ˆ˘˚
˛ ˛˚˛˝%
0"%ˆ
˛˙
!˝ˆ˛˘
%˜
%
%"
˚ˆ˝%
#
7
˘˛˘˙
:&;
˛˘˚%
˛˚,˚
"
$%˚˚%
%
˝%.
%
%˜
˝,
6
;˚
%˜
˛%
"
$%˚˚%
˚"%ˆ˘˜
˘˝˘˙
%
"˛
.˝˚
%
˚˛˜)˛˘%
/ˇ˚
˘˝˘˙
˝˘ˆ˜
˚˛˜)˛˘%
/ˇ˚
"˛
˛
#˚˜
˛˚
9$
˜˛˚
˜˛˘˚
:
"˚
˘
.˝˚
%
˚˛˜)˛˘%
/ˇ˚
˛
˜˜
%
˛˘˙
%
9$
˜˛˚
˜˛˘˚
"˛
˛
˜ˆ˛˘˘˙
#˚˜
˛˚
/ˇ˚
4˛˜
˚ˆ˝"
˚
˛
˛˘˚%
˛%˜
%
ˆ˚
˛˘
%˜˘˚8
7
9"˚
#%˛%˚
%.
˛,
˘˛˝
%
"
˘"%
ˆ
˝˛
˛˘˚%
˛,
ˆ˚
%.
˘˝%˝
˚˙˚˚
%
˚˝ˆ˙)˛˚˜
˚0˛
!˝ˆ%
.˛˝˘˚
&%
!˛˘
˛ ˛,
˛˝˛
˛˙
˚
%
%
%
%
/ˇ˚
˛ -ˆ˚
.%
-ˆ%˛%
4<
˝6
=8
.%
˛ ˚˝.˝
˚˝ˆ˙)˛˚˜
˚0˛
˛˜
˝
˛˝,
Of course, it displays the majority of pages correctly, but for some reason it won't display a couple of them. For confidentiality reasons, I can't post the pdfs. Does anyone have any idea why this is happening?
Also, anything you can point out to speed up my code or make it more dynamic is helpful as well. Always looking to learn.
Best, J.Dykstra
import PyPDF2
from os import walk
import os
import re
import csv
pdf_location = r'PDF Directory'
x = ['Search term 1', 'Search term 2', 'Search term 3', 'etc..']
key_terms = []
rule = []
filenamey = []
for dirpath, dirnames, filenames in walk(pdf_location):
for filename in filenames:
if filename.endswith('.pdf'):
pdfFileObj = open(os.path.join(dirpath,filename), 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj, strict = False)
num_pages = pdfReader.numPages
count = 0
text = ""
while count < num_pages:
pageObj = pdfReader.getPage(count)
count +=1
text += pageObj.extractText()
for i in x:
if re.search(i,text, re.IGNORECASE):
rulex = dirpath.split("Rule")[1]
filenamex = filename
key_termx = x[0]
key_terms.append(key_termx)
rule.append(rulex)