Downloading all PDF files from a website

Question

I need to make a windows desktop application in c# that downloads all the PDFs from a website. I have the link to the website but the problem i am facing is that the PDFs are not in a specific folder on the website but are scattered all over.

The thing i need is help at finding all those links so i can download them or any other advices that could help me with my problem.

Thanks to all help in advanced.

I am sure there are many free solutions available that could do it. — Andrew, Mar 16 '12 at 21:16
yes I think I need a spider but i didn't know what to search for. Now i have googled for spiders and i realy need something like that. — EaglesNiko, Mar 16 '12 at 21:24
While it would be a loooong way to your aim, if you really want to understand how to do that well look at [this free online course](http://www.udacity.com/overview/Course/cs101). At least for the crawler part. — om-nom-nom, Mar 16 '12 at 21:58

score 1 · Answer 1 · answered Mar 16 '12 at 21:19

1

Scrape through all the pages
Find all the "*.pdf" URLs
Reconstruct them and simply download :)

Please be more specific are you trying to get all the PDFs from the html page or from the whole domain ?

answered Mar 16 '12 at 21:19

lukas.pukenis

13,057
12
47
81

I need to find all the pdfs from the whole domain. – EaglesNiko Mar 16 '12 at 21:22
You can use google! Google for "*.pdf inurl:website" and it should give you a list of all indexed accessible PDF files. Does that help ? – lukas.pukenis Mar 16 '12 at 21:24
I thought about that but I wanted to find a more programmable solution – EaglesNiko Mar 16 '12 at 21:26
Sorry. Search for "-filetype:pdf inurl:domain" . Read here http://www.google.com/help/faq_filetypes.html – lukas.pukenis Mar 16 '12 at 21:28
I would stick to searching for "*.pdf" in the page as well as links to other HTML/HTM/ASP/ASPX/PHP pages in the document and then would loop through all the found pages and PDF links. Of course reconstructing PDF URL could cause some problem – lukas.pukenis Mar 16 '12 at 21:30

score 0 · Answer 2 · answered Aug 23 '18 at 06:18

0

What you are trying to do is known as Web scraping, there are some libraries which can make your task easy one of them is IronWebScraper but its paid one.

An extensive list of NuGet packages is available here which can be used for web scraping purpose.

answered Aug 23 '18 at 06:18

Sandeep Kumar Narware

190
2
7

Downloading all PDF files from a website

2 Answers2

Linked