Reading and extracting hyperlinks from Excel using C#

Question

So I have a excel file and in that file there are hyperlinks to PDFs which are OCR'd and I want to:

From the excel file go to the hyperlinked PDF
Convert the PDF to text files
Search through the PDF for certain keyword and then paste that keyword back to the excel file in reference to the row it belongs to.

If anyone can provide info on how this can be done that would be so much help.

What have you tried? Is there a specific issue you are having? — NetMage, Jan 13 '21 at 20:21

score 1 · Answer 1 · edited Jan 13 '21 at 21:19

this seems a bit complex. But I would go the following way:

You should be able to find a Excel Parser at NuGet. Google seems also to have tutorials on this like here: first link on google to a blog post
On how to read PDF files there are also posts on stack overflow like here: link to post
Now you can use the library from step 1 to update your excel file

So in the end you have to look for libraries that make your life easier. Look up their documentation and your life should become easier...

Actually you don't have to write that much, if you can use already existing code.

By the way have a look at regex when searching for keywords in plain text. Also python can be your friend. You could first extract all the links to the PDF files and write a little script to convert all those PDFs to plain text and then open it with c#. link to a post about how to convert PDF to text via python

Reading and extracting hyperlinks from Excel using C#

1 Answers1