12

I have seen this thread already - How can I unshorten a URL?

My issue with the resolved answer (that is using the unshort.me API) is that I am focusing on unshortening youtube links. Since unshort.me is used readily, this returns almost 90% of the results with captchas which I am unable to resolve.

So far I am stuck with using:

def unshorten_url(url):
    resolvedURL = urllib2.urlopen(url)  
    print resolvedURL.url

    #t = Test()
    #c = pycurl.Curl()
    #c.setopt(c.URL, 'http://api.unshort.me/?r=%s&t=xml' % (url))
    #c.setopt(c.WRITEFUNCTION, t.body_callback)
    #c.perform()
    #c.close()
    #dom = xml.dom.minidom.parseString(t.contents)
    #resolvedURL = dom.getElementsByTagName("resolvedURL")[0].firstChild.nodeValue
    return resolvedURL.url

Note: everything in the comments is what I tried to do when using the unshort.me service which was returning captcha links.

Does anyone know of a more efficient way to complete this operation without using open (since it is a waste of bandwidth)?

Community
  • 1
  • 1
brandonmat
  • 203
  • 3
  • 10
  • What url shortener are you having trouble with? Why are you using unshort.me anyways? Your code should already work, it should unshorten urls by following the redirection to the real url. – Zach Kelling Aug 22 '11 at 20:23
  • I don't understand what you mean by "without using open". A short link is a key into somebody else's database; you can't expand the link without querying the database. – Greg Hewgill Aug 22 '11 at 20:24
  • When I was reading the post that I referenced (http://stackoverflow.com/questions/4201062/how-can-i-unshorten-a-url-using-python) it seeemed like that command urlopen GET request the whole page so is a waste of bandwidth when all I am looking for is the link. The suggested method was not working for me (unshort.me) so I decided to see if there were any other alternatives. – brandonmat Aug 23 '11 at 01:28

5 Answers5

21

one line functions, using requests library and yes, it supports recursion.

def unshorten_url(url):
    return requests.head(url, allow_redirects=True).url
bersam
  • 414
  • 3
  • 12
  • I think this answer is even better than the most voted answer. Try with urls from fb.net and it returns the correct url while the other does nothing. – lenhhoxung Dec 09 '16 at 15:02
  • This is a one-liner and works perfectly. Probably the best answer. – Aventinus Oct 09 '17 at 12:04
  • Maybe a weird a question but should I close the connection after using `request.head` ? – Tito Sanz Feb 18 '21 at 09:23
  • @TitoSanz No, you can check the code, session is closed for all kind of requests (unless you open a session yourself): https://github.com/psf/requests/blob/4f6c0187150af09d085c03096504934eb91c7a9e/requests/api.py#L57-L61 – bersam Mar 03 '21 at 00:18
18

Use the best rated answer (not the accepted answer) in that question:

# This is for Py2k.  For Py3k, use http.client and urllib.parse instead, and
# use // instead of / for the division
import httplib
import urlparse

def unshorten_url(url):
    parsed = urlparse.urlparse(url)
    h = httplib.HTTPConnection(parsed.netloc)
    resource = parsed.path
    if parsed.query != "":
        resource += "?" + parsed.query
    h.request('HEAD', resource )
    response = h.getresponse()
    if response.status/100 == 3 and response.getheader('Location'):
        return unshorten_url(response.getheader('Location')) # changed to process chains of short urls
    else:
        return url
Andy Jackson
  • 356
  • 3
  • 13
Pedro Loureiro
  • 11,436
  • 2
  • 31
  • 37
  • Worked like a charm - I tried this yesterday to no avail since I was receiving errors on about 70% of the returns. May have just been a one-off thing and that's why I dismissed it. Thank you for your reply and sorry for my redundant question. – brandonmat Aug 22 '11 at 20:34
  • 2
    As a follow-up, I just remember why this way did not work for me. I am working on a twitter application and there are cases where a url is twice shortened (which happens a significant number of times). For example it will get this video [u't.co/LszdhNP'] and return this url etsy.me/r6JBGq - where I actually need the final youtube address that this links to. Do you know of any way to get around this? – brandonmat Aug 23 '11 at 00:14
  • 2
    a simple change was made in my answer – Pedro Loureiro Aug 23 '11 at 11:42
  • Great this works perfectly. I will look into this a bit more so that I understand it a bit better and can tweak it myself in the future. Thanks again. – brandonmat Aug 23 '11 at 16:55
  • 2
    Some websites (i.e. twitter) will try to force redirects from http to https. In this case, your solution will loop forever since all connections are assumed to be http and will continue to see redirect headers. To verify this, try running unshorten_url("[http://t.co/t](http://t.co/t)"). I suggest checking the parsed.scheme and optionally using httplib.HTTPSConnection(). – michaelxor Jun 11 '13 at 22:35
  • doesn't work for a URL like http://f-st.co/THHI6hC; it just returns me HTTP status code 500. – evandrix Mar 05 '15 at 16:09
  • For reference, the answer to this question: http://stackoverflow.com/questions/29425378/how-to-un-shorten-resolve-a-url-using-python-when-final-url-is-https also works with https – kyrenia Apr 03 '15 at 18:37
2

Here a src code that takes into account almost of the useful corner cases:

  • set a custom Timeout.
  • set a custom User Agent.
  • check whether we have to use an http or https connection.
  • resolve recursively the input url and prevent ending within a loop.

The src code is on github @ https://github.com/amirkrifa/UnShortenUrl

comments are welcome ...

import logging
logging.basicConfig(level=logging.DEBUG)

TIMEOUT = 10
class UnShortenUrl:
    def process(self, url, previous_url=None):
        logging.info('Init url: %s'%url)
        import urlparse
        import httplib
        try:
            parsed = urlparse.urlparse(url)
            if parsed.scheme == 'https':
                h = httplib.HTTPSConnection(parsed.netloc, timeout=TIMEOUT)
            else:
                h = httplib.HTTPConnection(parsed.netloc, timeout=TIMEOUT)
            resource = parsed.path
            if parsed.query != "": 
                resource += "?" + parsed.query
            try:
                h.request('HEAD', 
                          resource, 
                          headers={'User-Agent': 'curl/7.38.0'}

                          )
                response = h.getresponse()
            except:
                import traceback
                traceback.print_exec()
                return url
            logging.info('Response status: %d'%response.status)
            if response.status/100 == 3 and response.getheader('Location'):
                red_url = response.getheader('Location')
                logging.info('Red, previous: %s, %s'%(red_url, previous_url))
                if red_url == previous_url:
                    return red_url
                return self.process(red_url, previous_url=url) 
            else:
                return url 
        except:
            import traceback
            traceback.print_exc()
            return None
Amir Krifa
  • 61
  • 3
2

You DO have to open it, otherwise you won't know what URL it will redirect to. As Greg put it:

A short link is a key into somebody else's database; you can't expand the link without querying the database

Now to your question.

Does anyone know of a more efficient way to complete this operation without using open (since it is a waste of bandwidth)?

The more efficient way is to not close the connection, keep it open in the background, by using HTTP's Connection: keep-alive.

After a small test, unshorten.me seems to take the HEAD method into account and doing a redirect to itself:

> telnet unshorten.me 80
Trying 64.202.189.170...
Connected to unshorten.me.
Escape character is '^]'.
HEAD http://unshort.me/index.php?r=http%3A%2F%2Fbit.ly%2FcXEInp HTTP/1.1
Host: unshorten.me

HTTP/1.1 301 Moved Permanently
Date: Mon, 22 Aug 2011 20:42:46 GMT
Server: Microsoft-IIS/6.0
X-Powered-By: ASP.NET
X-AspNet-Version: 2.0.50727
Location: http://resolves.me/index.php?r=http%3A%2F%2Fbit.ly%2FcXEInp
Cache-Control: private
Content-Length: 0

So if you use the HEAD HTTP method, instead of GET, you will actually end up doing the same work twice.

Instead, you should keep the connection alive, which will save you only a little bandwidth, but what it will certainly save is the latency of establishing a new connection every time. Establishing a TCP/IP connection is expensive.

You should get away with a number of kept-alive connections to the unshorten service equal to the number of concurrent connections your own service receives.

You could manage these connections in a pool. That's the closest you can get. Beside tweaking your kernel's TCP/IP stack.

Flavius
  • 13,566
  • 13
  • 80
  • 126
  • Awesome thank you for the information. I am currently going to use Pedro Loureiro answer above since it is working for the time being. But I will refer back to this if I run into any problems. Much appreciated. – brandonmat Aug 22 '11 at 21:14
-2
import requests

short_url = "<your short url goes here>"
long_url = requests.get(short_url).url
print(long_url)
  • 1
    [A code-only answer is not high quality](//meta.stackoverflow.com/questions/392712/explaining-entirely-code-based-answers). While this code may be useful, you can improve it by saying why it works, how it works, when it should be used, and what its limitations are. Please [edit] your answer to include explanation and link to relevant documentation. – Stephen Ostermiller Feb 09 '23 at 10:29