Find multiple text in pdfs

Question

I'm currently trying to pull pdf's with the following list of text. I was able to pull pdf's but with only one word. should i change my condition below? thanks in advance. newbie here.

from tika import parser
import glob

path = glob.glob(r"C:\Users\kxdane\Desktop\TEST\OKED\*.pdf")

for path in path:

pdf_files = glob.glob(path)

text = (['Disclosure','M.D.'])
for file in pdf_files:
    raw = parser.from_file(file)
    if text in raw['content']:
        print(file)`

score 0 · Accepted Answer · answered May 11 '22 at 11:58

In python, substring search works only with single argument. So you need to search for all substrings in a loop and combine the results using logical AND, for example like this:

...
words = ['Disclosure','M.D.']
for file in pdf_files:
    raw = parser.from_file(file)
    found = True
    for word in words:
      if word not in raw['content']:
        found = False
    if found:
      print(file)

Note: if words is empty list, this will match all pdf_files.

Find multiple text in pdfs

1 Answers1