-3

I have hundreds of articles on many topics in pdf files in a directory. I need to point some papers containing the keywords git log or git diff command from those hundreds of articles. Then, I will collect the selected articles in a list.

How can we do that using Python?

YusufUMS
  • 1,506
  • 1
  • 12
  • 24
  • This has been asked many times before: Check https://stackoverflow.com/questions/17098675/searching-text-in-a-pdf-using-python or https://stackoverflow.com/questions/34837707/extracting-text-from-a-pdf-file-using-python or https://stackoverflow.com/questions/11570466/script-to-search-for-text-from-pdf – orangeInk Dec 15 '17 at 07:52
  • However, what I want to do is to make a list of the selected pdf files in a text editor (e.g. notepad) from those hundreds of files using a python script. – YusufUMS Dec 15 '17 at 08:05

1 Answers1

2

If you are not opposed to using a library, you can use https://github.com/euske/pdfminer

I've done something of the sort for nodejs, just recursively scan the directory and scan every file with pdfminer and make it return the results.

Goodluck!

Rheijn
  • 319
  • 3
  • 11