118

I'm deploying a replacement site for a client but they don't want all their old pages to end in 404s. Keeping the old URL structure wasn't possible because it was hideous.

So I'm writing a 404 handler that should look for an old page being requested and do a permanent redirect to the new page. Problem is, I need a list of all the old page URLs.

I could do this manually, but I'd be interested if there are any apps that would provide me a list of relative (eg: /page/path, not http:/.../page/path) URLs just given the home page. Like a spider but one that doesn't care about the content other than to find deeper pages.

Kara
  • 6,115
  • 16
  • 50
  • 57
Oli
  • 235,628
  • 64
  • 220
  • 299

8 Answers8

92

I didn't mean to answer my own question but I just thought about running a sitemap generator. First one I found http://www.xml-sitemaps.com has a nice text output. Perfect for my needs.

Oli
  • 235,628
  • 64
  • 220
  • 299
59

do wget -r -l0 www.oldsite.com

Then just find www.oldsite.com would reveal all urls, I believe.

Alternatively, just serve that custom not-found page on every 404 request! I.e. if someone used the wrong link, he would get the page telling that page wasn't found, and making some hints about site's content.

alamar
  • 18,729
  • 4
  • 64
  • 97
  • 21
    Notably, since this returns a list of *files*, not URLs, this would only really work for sites that are collections of static HTML files. If the site has URL query parameters, server-side rewritten URLs, or any kind of `include`/`require`/etc. assembling of pages, this won't really work. – T.J. Schuck Jun 24 '11 at 19:41
  • 1
    I might be misunderstanding wget. I thought 'wget' was for downloading the contents of the site? – Cosmic Hawk Oct 21 '16 at 16:06
  • 1
    @Doomsy yes, but when you've downloaded all the content you surely know all the URLs to that content, and without downloading there's no way to find out URLs. – alamar Oct 24 '16 at 09:21
  • 4
    Consider the default depth. https://www.gnu.org/software/wget/manual/html_node/Recursive-Retrieval-Options.html#Recursive-Retrieval-Options – PJ Brunet Aug 22 '18 at 02:36
  • 1
    @PJBrunet thx, fixed! – alamar Aug 22 '18 at 09:00
  • 2
    @alamar Yes there's "-r -l inf" for infinite recursion, but I recommend people check out the documentation--so many cool options! The "-m" option will mirror and I'm going to try "-R.jpg,.jpeg,.gif,.png" which I think skips images. – PJ Brunet Aug 22 '18 at 23:00
  • 1
    can we export this list into a pdf file or excel file? – Pooya Panahandeh Jun 05 '20 at 12:08
  • 1
    I don't see why not? `find -type f > urls.csv` – alamar Jun 05 '20 at 12:32
24

Here is a list of sitemap generators (from which obviously you can get the list of URLs from a site): http://code.google.com/p/sitemap-generators/wiki/SitemapGenerators

Web Sitemap Generators

The following are links to tools that generate or maintain files in the XML Sitemaps format, an open standard defined on sitemaps.org and supported by the search engines such as Ask, Google, Microsoft Live Search and Yahoo!. Sitemap files generally contain a collection of URLs on a website along with some meta-data for these URLs. The following tools generally generate "web-type" XML Sitemap and URL-list files (some may also support other formats).

Please Note: Google has not tested or verified the features or security of the third party software listed on this site. Please direct any questions regarding the software to the software's author. We hope you enjoy these tools!

Server-side Programs

  • Enarion phpSitemapsNG (PHP)
  • Google Sitemap Generator (Linux/Windows, 32/64bit, open-source)
  • Outil en PHP (French, PHP)
  • Perl Sitemap Generator (Perl)
  • Python Sitemap Generator (Python)
  • Simple Sitemaps (PHP)
  • SiteMap XML Dynamic Sitemap Generator (PHP) $
  • Sitemap generator for OS/2 (REXX-script)
  • XML Sitemap Generator (PHP) $

CMS and Other Plugins:

  • ASP.NET - Sitemaps.Net
  • DotClear (Spanish)
  • DotClear (2)
  • Drupal
  • ECommerce Templates (PHP) $
  • Ecommerce Templates (PHP or ASP) $
  • LifeType
  • MediaWiki Sitemap generator
  • mnoGoSearch
  • OS Commerce
  • phpWebSite
  • Plone
  • RapidWeaver
  • Textpattern
  • vBulletin
  • Wikka Wiki (PHP)
  • WordPress

Downloadable Tools

  • GSiteCrawler (Windows)
  • GWebCrawler & Sitemap Creator (Windows)
  • G-Mapper (Windows)
  • Inspyder Sitemap Creator (Windows) $
  • IntelliMapper (Windows) $
  • Microsys A1 Sitemap Generator (Windows) $
  • Rage Google Sitemap Automator $ (OS-X)
  • Screaming Frog SEO Spider and Sitemap generator (Windows/Mac) $
  • Site Map Pro (Windows) $
  • Sitemap Writer (Windows) $
  • Sitemap Generator by DevIntelligence (Windows)
  • Sorrowmans Sitemap Tools (Windows)
  • TheSiteMapper (Windows) $
  • Vigos Gsitemap (Windows)
  • Visual SEO Studio (Windows)
  • WebDesignPros Sitemap Generator (Java Webstart Application)
  • Weblight (Windows/Mac) $
  • WonderWebWare Sitemap Generator (Windows)

Online Generators/Services

  • AuditMyPc.com Sitemap Generator
  • AutoMapIt
  • Autositemap $
  • Enarion phpSitemapsNG
  • Free Sitemap Generator
  • Neuroticweb.com Sitemap Generator
  • ROR Sitemap Generator
  • ScriptSocket Sitemap Generator
  • SeoUtility Sitemap Generator (Italian)
  • SitemapDoc
  • Sitemapspal
  • SitemapSubmit
  • Smart-IT-Consulting Google Sitemaps XML Validator
  • XML Sitemap Generator
  • XML-Sitemaps Generator

CMS with integrated Sitemap generators

  • Concrete5

Google News Sitemap Generators The following plugins allow publishers to update Google News Sitemap files, a variant of the sitemaps.org protocol that we describe in our Help Center. In addition to the normal properties of Sitemap files, Google News Sitemaps allow publishers to describe the types of content they publish, along with specifying levels of access for individual articles. More information about Google News can be found in our Help Center and Help Forums.

  • WordPress Google News plugin

Code Snippets / Libraries

  • ASP script
  • Emacs Lisp script
  • Java library
  • Perl script
  • PHP class
  • PHP generator script

If you believe that a tool should be added or removed for a legitimate reason, please leave a comment in the Webmaster Help Forum.

Franck Dernoncourt
  • 77,520
  • 72
  • 342
  • 501
6

The best on I have found is http://www.auditmypc.com/xml-sitemap.asp which uses Java, and has no limit on pages, and even lets you export results as a raw URL list.

It also uses sessions, so if you are using a CMS, make sure you are logged out before you run the crawl.

Collins
  • 1,069
  • 14
  • 35
3

So, in an ideal world you'd have a spec for all pages in your site. You would also have a test infrastructure that could hit all your pages to test them.

You're presumably not in an ideal world. Why not do this...?

  1. Create a mapping between the well known old URLs and the new ones. Redirect when you see an old URL. I'd possibly consider presenting a "this page has moved, it's new url is XXX, you'll be redirected shortly".

  2. If you have no mapping, present a "sorry - this page has moved. Here's a link to the home page" message and redirect them if you like.

  3. Log all redirects - especially the ones with no mapping. Over time, add mappings for pages that are important.

Martin Peck
  • 11,440
  • 2
  • 42
  • 69
2

wget from a linux box might also be a good option as there are switches to spider and change it's output.

EDIT: wget is also available on Windows: http://gnuwin32.sourceforge.net/packages/wget.htm

Thomas Schultz
  • 2,446
  • 3
  • 25
  • 36
1

Write a spider which reads in every html from disk and outputs every "href" attribute of an "a" element (can be done with a parser). Keep in mind which links belong to a certain page (this is common task for a MultiMap datastructre). After this you can produce a mapping file which acts as the input for the 404 handler.

Mork0075
  • 5,895
  • 4
  • 25
  • 24
1

I would look into any number of online sitemap generation tools. Personally, I've used this one (java based)in the past, but if you do a google search for "sitemap builder" I'm sure you'll find lots of different options.

Eric Petroelje
  • 59,820
  • 9
  • 127
  • 177