0

Does anyone know of a library for fixing "broken" urls. When I try to open a url such as

http://www.domain.com/../page.html
http://www.domain.com//page.html
http://www.domain.com/page.html#stuff

urllib2.urlopen chokes and gives me an HTTPError traceback. Does anyone know of a library that can fix these sorts of things?

erikcw
  • 10,787
  • 15
  • 58
  • 75
  • 1
    The last one is perfectly valid isn't it? – SeanJA Sep 17 '09 at 02:20
  • Why not do a scan of the urls (on a website I presume) and then with the found urls couldn't you just use regex to replace the bad ones or at worst replace them by hand? – SeanJA Sep 17 '09 at 02:24
  • 1
    @SeanJA: the last one is valid for a *browser*, but the browser will remove the `#stuff` part before sending to the server. A server is likely to reject a URL with `#stuff` on the end, which is why the OP found an error with `urlopen`. Such affixes must be removed before asking a server about that URL. – Greg Hewgill Sep 17 '09 at 02:28
  • 1
    I wouldn't even try to fix the first two. At best they are probably malformed through too many copy and pastes (missing the 'cgi-bin/awesomeblog' part), and at worst they are an attempt at peeking outside of htdocs. how would you 'fix' a url like http://www.example.com/../../etc/password – SingleNegationElimination Sep 17 '09 at 03:19
  • may be first try to determine why they are wrong at first place? –  Sep 17 '09 at 05:40

1 Answers1

1

What about something like...:

import re
import urlparse

urls = '''
http://www.domain.com/../page.html
http://www.domain.com//page.html
http://www.domain.com/page.html#stuff
'''.split()

def main():
  for u in urls:
    pieces = list(urlparse.urlparse(u))
    pieces[2] = re.sub(r'^[./]*', '/', pieces[2])
    pieces[-1] = ''
    print urlparse.urlunparse(pieces)

main()

it does emit, as you desire:

http://www.domain.com/page.html
http://www.domain.com/page.html
http://www.domain.com/page.html

and would appear to roughly match your needs, if I understood them correctly.

Alex Martelli
  • 854,459
  • 170
  • 1,222
  • 1,395
  • http://www.domain.com/ohmygod///page.html http://www.domain.com/nono/...page.html will not work –  Sep 17 '09 at 05:39
  • 1
    Right, I'm only fixing breakages at the start of the path, as per the only examples the OP gave. You can fix a few more breakages by `path.split('/')`, ignoring empty pieces and removing stray leading dots. But there are a higher-order infinity of possible broken URLs, unless SOME specs are given it's impossible to KNOW what to fix!-) – Alex Martelli Sep 17 '09 at 05:47