1

I am trying to loop through a set of pdfs (all are OCR'd) in a set of folders and search for key terms in the pdf and if pdf contains a certain term, then save the folder name, file name, etc.. This code is working to an extent. Except, it is missing a few pdfs within the search terms. The reason is because when I read in a couple of the pdfs it displays some jibberish (to me at least) on a couple of pages. For example, say I have read in a pdf named 'the_one.pdf'. It has 278 pages. When I go into adobe acrobat to search this document, I can find 'Search Term 1' on page 171, but when it is read with python, python outputs something like this:

 -ˆ˜
 %
 ˜%˝ˆ
 ,˙
 ˚
 %.
 %,˛#
 %˜˚
 0"
 ˚˝
 %
 ˚˝ˆ˙)˛˚˜
 ˚0˛˚
 :&;
 #˛˘˘˙
 ˚%˚
 "
 %˚˛˘
 ˆ
 ˛˚,˚
 "
 $%˚˚%
 %
 ˝%.
 "˛
 "
 %˜
 ˝,
 -ˆ
 %˘˙
 ˛˘˚
 0"
 "
 ˛

 .˛˝
 %˜˚
 ˝˜
 .%
 !˝ˆ%
 4
 0"
 "
 %˜˚
 ˛
 %˛˘˘˙
 !˝ˆ˜
 %
 ˛ ˚˝ˆ˙)˛˚˜
 ˚0˛
 !˝ˆ%
 .˛˝˘˙8
 ˛˜
 %
 0"
 "
 ˚
 ˛ #%˛%
 "˛
 ˚ˆ˘˚

 ˛ ˛˚˛˝%
 0"%ˆ
 ˛˙
 !˝ˆ˛˘
 %˜
 %
 %"
 ˚ˆ˝%
 #
7
 ˘˛˘˙
 :&;
 ˛˘˚%
 ˛˚,˚
 "
 $%˚˚%
 %
 ˝%.
 %
 %˜
 ˝,
 6
 ;˚
 %˜
 ˛%
 "
 $%˚˚%
 ˚"%ˆ˘˜
 ˘˝˘˙
 %
 "˛
 .˝˚
 %
 ˚˛˜)˛˘%
 /ˇ˚
 ˘˝˘˙
 ˝˘ˆ˜
 ˚˛˜)˛˘%
 /ˇ˚
 "˛
 ˛
 #˚˜
 ˛˚
 9$
 ˜˛˚
 ˜˛˘˚
 :
 "˚
 ˘
 .˝˚
 %
 ˚˛˜)˛˘%
 /ˇ˚
 ˛
 ˜˜
 %
 ˛˘˙
 %
 9$
 ˜˛˚
 ˜˛˘˚
 "˛
 ˛
 ˜ˆ˛˘˘˙
 #˚˜
 ˛˚
 /ˇ˚
 4˛˜
 ˚ˆ˝"
 ˚
 ˛
 ˛˘˚%
 ˛%˜
 %
 ˆ˚
 ˛˘
 %˜˘˚8
 7
 9"˚
 #%˛%˚
 %.
 ˛,
 ˘˛˝
 %
 "
 ˘"%
 ˆ
 ˝˛
 ˛˘˚%
 ˛,
 ˆ˚
 %.
 ˘˝%˝
 ˚˙˚˚
 %

 ˚˝ˆ˙)˛˚˜
 ˚0˛
 !˝ˆ%
 .˛˝˘˚
 &%
 !˛˘
 ˛ ˛,
 ˛˝˛
 ˛˙
 ˚
 %
 %
 %
 %
 /ˇ˚
 ˛ -ˆ˚
 .%
 -ˆ%˛%
 4<
˝6
=8
 .%
 ˛ ˚˝.˝
 ˚˝ˆ˙)˛˚˜
 ˚0˛
 ˛˜
 ˝
 ˛˝,  

Of course, it displays the majority of pages correctly, but for some reason it won't display a couple of them. For confidentiality reasons, I can't post the pdfs. Does anyone have any idea why this is happening?

Also, anything you can point out to speed up my code or make it more dynamic is helpful as well. Always looking to learn.

Best, J.Dykstra

import PyPDF2 
from os import walk
import os
import re
import csv

pdf_location = r'PDF Directory' 
x = ['Search term 1', 'Search term 2', 'Search term 3', 'etc..']

key_terms = []
rule = []
filenamey = []

for dirpath, dirnames, filenames in walk(pdf_location):
    for filename in filenames:
        if filename.endswith('.pdf'):
            pdfFileObj = open(os.path.join(dirpath,filename), 'rb')
            pdfReader = PyPDF2.PdfFileReader(pdfFileObj, strict = False)
            num_pages = pdfReader.numPages
            count = 0
            text = ""

            while count < num_pages:
                pageObj = pdfReader.getPage(count)
                count +=1
                text += pageObj.extractText()


            for i in x:
                if re.search(i,text, re.IGNORECASE):
                    rulex = dirpath.split("Rule")[1]
                    filenamex = filename
                    key_termx = x[0]

                    key_terms.append(key_termx)
                    rule.append(rulex)
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
J. Dykstra
  • 201
  • 1
  • 10

1 Answers1

3

Parsing PDF is a complex task, the 1.7 spec has around 750 pages and Adobe makes money with it - thats why it works for them.

PDFs internally have tables that hold

  • "how letters look" (glyphs)
  • "what unicode letters those glyphs are mapped to" (you need that to copy&paste someting from pdf correctly)

and a cross-ref which glyph mapps to what unicode. Fonts might be (partly) be embedded in the pdf as well.

Thats (one reason) why pdfs can look 100% ok, could be "OCR"ed ok - but if you just copy&paste from a document that has a corrupt mapping between glyphs and unicode points, you only get gibberish.

I have heard some programms even provide unicode mappings for all glyphs but they do not match up at all ... on purpose (or bad quality) - to prevent copy&paste.

Bottom line: you can try to re-OCR some pages, you could use Adobe Acrobat PRO to extract text from PDF (it has build in ocr features) that give you gibberish or just skip it.

You can try some other pdf-reading framework, maybe they got something not quite right - but chances are slim if it almost always works but just not for a few special pdfs.

I am just a novice in pdf - there are some more advanced ppl around to pipe in on this - but if you cannot share the pdf its going to be hard to advice anything.


Alternate approaches: Searching text in a PDF using Python?

Patrick Artner
  • 50,409
  • 9
  • 43
  • 69
  • This summary is as good an answer as one can give as long as one cannot analyze the pdf in question. – mkl May 16 '18 at 04:36
  • [ISO 32000-2 - Pdf 2.0 - Spec](https://www.iso.org/standard/63534.html) (non-free, ~1k pages) is the newer PDF spec, but at this time most programs in use will cater to the 1.7 one – Patrick Artner May 16 '18 at 05:51
  • I have decided to post the pdf. What would be the best way to provide it? – J. Dykstra May 17 '18 at 18:35