How to find some pdf files contain some keyword using python

Question

I have hundreds of articles on many topics in pdf files in a directory. I need to point some papers containing the keywords git log or git diff command from those hundreds of articles. Then, I will collect the selected articles in a list.

How can we do that using Python?

This has been asked many times before: Check https://stackoverflow.com/questions/17098675/searching-text-in-a-pdf-using-python or https://stackoverflow.com/questions/34837707/extracting-text-from-a-pdf-file-using-python or https://stackoverflow.com/questions/11570466/script-to-search-for-text-from-pdf — orangeInk, Dec 15 '17 at 07:52
However, what I want to do is to make a list of the selected pdf files in a text editor (e.g. notepad) from those hundreds of files using a python script. — YusufUMS, Dec 15 '17 at 08:05

score 2 · Answer 1 · answered Dec 15 '17 at 07:52

2

If you are not opposed to using a library, you can use https://github.com/euske/pdfminer

I've done something of the sort for nodejs, just recursively scan the directory and scan every file with pdfminer and make it return the results.

Goodluck!

answered Dec 15 '17 at 07:52

Rheijn

319
3
11

How to find some pdf files contain some keyword using python

1 Answers1