3

I need to make a windows desktop application in c# that downloads all the PDFs from a website. I have the link to the website but the problem i am facing is that the PDFs are not in a specific folder on the website but are scattered all over.

The thing i need is help at finding all those links so i can download them or any other advices that could help me with my problem.

Thanks to all help in advanced.

svick
  • 236,525
  • 50
  • 385
  • 514
EaglesNiko
  • 31
  • 1
  • 2
  • 4
    So you want to write a spider? – SLaks Mar 16 '12 at 21:13
  • I am sure there are many free solutions available that could do it. – Andrew Mar 16 '12 at 21:16
  • yes I think I need a spider but i didn't know what to search for. Now i have googled for spiders and i realy need something like that. – EaglesNiko Mar 16 '12 at 21:24
  • While it would be a loooong way to your aim, if you really want to understand how to do that well look at [this free online course](http://www.udacity.com/overview/Course/cs101). At least for the crawler part. – om-nom-nom Mar 16 '12 at 21:58

2 Answers2

1
  1. Scrape through all the pages
  2. Find all the "*.pdf" URLs
  3. Reconstruct them and simply download :)

Please be more specific are you trying to get all the PDFs from the html page or from the whole domain ?

lukas.pukenis
  • 13,057
  • 12
  • 47
  • 81
  • I need to find all the pdfs from the whole domain. – EaglesNiko Mar 16 '12 at 21:22
  • You can use google! Google for "*.pdf inurl:website" and it should give you a list of all indexed accessible PDF files. Does that help ? – lukas.pukenis Mar 16 '12 at 21:24
  • I thought about that but I wanted to find a more programmable solution – EaglesNiko Mar 16 '12 at 21:26
  • Sorry. Search for "-filetype:pdf inurl:domain" . Read here http://www.google.com/help/faq_filetypes.html – lukas.pukenis Mar 16 '12 at 21:28
  • I would stick to searching for "*.pdf" in the page as well as links to other HTML/HTM/ASP/ASPX/PHP pages in the document and then would loop through all the found pages and PDF links. Of course reconstructing PDF URL could cause some problem – lukas.pukenis Mar 16 '12 at 21:30
0

What you are trying to do is known as Web scraping, there are some libraries which can make your task easy one of them is IronWebScraper but its paid one.

An extensive list of NuGet packages is available here which can be used for web scraping purpose.