0

Suppose we have a website called http://www.example.com. I would like to get a list of its URI pages (just the URLs themselves, not URLs inside those URLs) - either all of them (including all subdomains and all subpages), or just some of them provided that they follow a particular globbing and/or regex pattern.

So, for example, I'm looking for something that gets all URLs (just the URL addresses themselves) that follow a pattern such as http://*.example.com/*. I'm aware that globbing in Linux (e.g. via the shell) is (mostly or fully?) limited to local files and directories (correct me if I'm wrong).

How can I achieve this?

I suppose that something related (although not quite the same?) is discussed here: How to find all links / pages on a website.

P.S. All of the URLs are part of a website that is made of static webpages only. I'm not sure if it's even possible to do the same thing with websites that are made of dynamic webpages... Also, I'm not sure if any URLs with query strings in them (e.g. http://www.example.com/?=abc&xyz) can be captured at all in such a way.

Community
  • 1
  • 1
sahwar
  • 73
  • 1
  • 5
  • I didn't fully understand the question.. can you give a more detailed example? – Bozho May 18 '15 at 20:41
  • @Bozho Well, it's simple. It's about either getting _all_ URL addresses (and *not* the contents of those URLs, just the web addresses themselves) of all the pages of a particular domain (including its subdomains) or getting _some_ of these URLs based on a regex-/globbing-like pattern. – sahwar May 19 '15 at 18:33
  • so you want to crawl it? – Bozho Jul 04 '15 at 08:08
  • @Bozho Yes, but only to get all URLs of a particular web domain, not their contents. The web scraping part will be done by using _the list_ of those crawled URLs. Any solution is welcome! :) P.S. The only requirement is to be able to use globbing/regex(-like) pattern matching to restrict the URLs that I want to get from the web domain. – sahwar Aug 28 '15 at 16:57

0 Answers0