0

I would like to know if it is possible to list URLs from a website. Those URLs are the ones hosting zip files and if you provided them correctly, files will be downloaded. If not, you are directed to a 404 page.

For example, if the main site is https://myexample.net/, I am interested in files under https://myexample.net/wp-content/uploads/2018/04/[do not have a pattern].zip. I tried to access https://myexample.net/wp-content/uploads/2018/04/, but got a 404 error.

In addition, I checked https://myexample.net/sitemap_index.xml, but did not find those URLs of my interests. So the question is how to guess those URLs... Appreciate any suggestions!

TTT
  • 4,354
  • 13
  • 73
  • 123
  • `https://myexample.net/wp-content/uploads/2018/04/` probably displays the listing of the zip files, the HTML of which could be scraped. Could you post your actual link? It would make it much easier to write a working solution. – Ajax1234 Apr 03 '18 at 00:34
  • @Ajax1234, thanks for your suggestion. I tried to access `https://myexample.net/wp-content/uploads/2018/04/`, but got a `404 error... – TTT Apr 03 '18 at 00:36

2 Answers2

1

Have you tried using a sitemap generator?

There is a python library for it as well: https://pypi.python.org/pypi/sitemap-generator/0.5.2

There are also browser plugins to do this if you don't want to code, such as "uSelect iDownload" tool for Chrome.

Pie Who
  • 29
  • 4
1

I would like to know if it is possible to list URLs from a website?

Now if you are talking about a specific website or any generic website.

I have done decent amount of scraping using Scrapy for years. Now below is my experience

  1. Many sites don't use sitemaps at all
  2. Sites that do use sitemaps have a very old sitemap which was updated long back
  3. Latest sitemap generated only has limited urls and not all the urls

So all in all, sitemaps can be good for generating a list of seed urls, but they are controlled by the website administrator and they may or may keep sitemaps updated. So if you really want a list of urls, you need to use crawling. If you don't want to use code for the same, then you can look few approach discussed in below threads

Spider a Website and Return URLs Only

If you want to go the coding I would suggest you have a look at Scrapy

Scrapy crawl all sitemap links

Using Scrapy to parse sitemaps

Tarun Lalwani
  • 142,312
  • 9
  • 204
  • 265