The title pretty much says it all. I would like to search a pdf for certain keywords and then identify what page those keywords are on.
Asked
Active
Viewed 5,172 times
-1
-
1What have you tried so far? Stackoverflow is not a code writing service. Give it a try and post back here with specific issues that you are having with your code. – EBGreen Jan 19 '18 at 19:08
-
I'm not sure where to begin. My general idea from coding in C++ and Python is to start with a nested for loop system that would increment through each page and if it found a match, would save the page number. Within this loop, there might be another for loop that increments through each line. Since I only need to know whether the word occurs on the page or not once I find one instance of it I can stop searching on that page, and go on to the next. If it was a text file this wouldn't be so hard, but I'm not sure how to adapt this approach to a PDF and through PowerShell. – Kurt Hoelsema Jan 19 '18 at 19:16
-
1Have you googled anything? I found many, many, and I do mean many promising hits from a simple search. – EBGreen Jan 19 '18 at 19:18
-
@EBGreen Yes, I have googled it, but have not come across any PowerShell solutions. It appears many people have wanted to do this, but by some answers given, it is not easy. What exactly did you search? Or could you give a link to what you found? – Kurt Hoelsema Jan 19 '18 at 19:25
-
I just searched 'powershell search in pdf'. The first link has powershell code. – EBGreen Jan 19 '18 at 19:32
-
1You might want to check out iTextSharp. It's a module written in C# for handling PDF files. You could load the dll into Powershell and be able to access the functions it has for file manipulation. – trebleCode Jan 19 '18 at 19:59
-
It is the module that is used in that first hit that I mentioned. :) – EBGreen Jan 19 '18 at 20:04
-
Can you post your code? – trebleCode Jan 19 '18 at 20:57
-
This is probably really stupid. I haven't worked much in Powershell. I'm trying to use iTextSharp as suggested using this code to load the dll: Add-Type -Path 'C:\Program Files\itextsharp-all-5.5.10\itextsharp-dll-core\itextsharp.dll' and am getting this error: "Files\itextsharp-all-5.5.10\itextsharp-dll-core\itextsharp.dll' or one of its dependencies. Operation is not supported. (Exception from HRESULT: 0x80131515)" – Kurt Hoelsema Jan 20 '18 at 03:32
-
Please, please, please...always edit your question to show your code. Reading code in comments sucks. – EBGreen Jan 22 '18 at 15:11
1 Answers
2
Dove tailing on the iTextSharp bit. Your question sounds similar to this post.
How can I read PDF content with the itextsharp with the Pdfreader class. My PDF may include Plain text or Images of the text Reading PDF content with itextsharp dll in VB.NET or C#
Thus in PoSH, as trebleCode states, you can do something like....
Add-Type -Path .\itextsharp.dll
$reader = New-Object iTextSharp.text.pdf.pdfreader -ArgumentList "$pwd\test.pdf"
# the rest of your code starts here

postanote
- 15,138
- 2
- 14
- 25
-
Thanks for the response! I had something very similar, but PowerShell said it could not load the dll. I figured out this was because it was blocked because it was a downloaded file. I unblocked the file and now it is loading properly. – Kurt Hoelsema Jan 20 '18 at 16:21
-
NO worries. Yep, the ADS (Alternate Data Stream stuff) on downloads folks often never pay attention to, until they get hit by it. I have a PoSH permanent FileSystemWatcher in place pointed at my download folders for this very reason. As soon as a file hits one of those folder it fires, check for an ADS, and unblocks it. This way I don't have to remember to do it manually. – postanote Jan 21 '18 at 00:29