Fixing broken urls

Question

Does anyone know of a library for fixing "broken" urls. When I try to open a url such as

http://www.domain.com/../page.html
http://www.domain.com//page.html
http://www.domain.com/page.html#stuff

urllib2.urlopen chokes and gives me an HTTPError traceback. Does anyone know of a library that can fix these sorts of things?

Why not do a scan of the urls (on a website I presume) and then with the found urls couldn't you just use regex to replace the bad ones or at worst replace them by hand? — SeanJA, Sep 17 '09 at 02:24
@SeanJA: the last one is valid for a *browser*, but the browser will remove the `#stuff` part before sending to the server. A server is likely to reject a URL with `#stuff` on the end, which is why the OP found an error with `urlopen`. Such affixes must be removed before asking a server about that URL. — Greg Hewgill, Sep 17 '09 at 02:28
I wouldn't even try to fix the first two. At best they are probably malformed through too many copy and pastes (missing the 'cgi-bin/awesomeblog' part), and at worst they are an attempt at peeking outside of htdocs. how would you 'fix' a url like http://www.example.com/../../etc/password — SingleNegationElimination, Sep 17 '09 at 03:19
may be first try to determine why they are wrong at first place? — , Sep 17 '09 at 05:40

score 1 · Answer 1 · answered Sep 17 '09 at 03:10

1

What about something like...:

import re
import urlparse

urls = '''
http://www.domain.com/../page.html
http://www.domain.com//page.html
http://www.domain.com/page.html#stuff
'''.split()

def main():
  for u in urls:
    pieces = list(urlparse.urlparse(u))
    pieces[2] = re.sub(r'^[./]*', '/', pieces[2])
    pieces[-1] = ''
    print urlparse.urlunparse(pieces)

main()

it does emit, as you desire:

http://www.domain.com/page.html
http://www.domain.com/page.html
http://www.domain.com/page.html

and would appear to roughly match your needs, if I understood them correctly.

answered Sep 17 '09 at 03:10

Alex Martelli

854,459
170
1,222
1,395

http://www.domain.com/ohmygod///page.html http://www.domain.com/nono/...page.html will not work – Sep 17 '09 at 05:39
1

Right, I'm only fixing breakages at the start of the path, as per the only examples the OP gave. You can fix a few more breakages by `path.split('/')`, ignoring empty pieces and removing stray leading dots. But there are a higher-order infinity of possible broken URLs, unless SOME specs are given it's impossible to KNOW what to fix!-) – Alex Martelli Sep 17 '09 at 05:47

Fixing broken urls

1 Answers1