2

How can I use regex to extract url from the following text:

/url?q=http://www.linkedin.com/in/sujachandrasekaran&sa=u&ei=gptuu5b6kogtyatduicidq&ved=0cbqqfjaa&usg=afqjcnejdwki_gcnxgzsd4apxey1k2swlw

Desired result is:

http://www.linkedin.com/in/sujachandrasekaran

I used this

a = "/url?q=http://www.linkedin.com/in/sujachandrasekaran&sa=u&ei=1jxuu8qxgtwaygs_u4gaaq&ved=0cceqfjaa&usg=afqjcnfl2pecdcddktw_pw9nelfohjp0ca"
linkedin_links = re.findall('(http.*)&',a)

and it gave me this:

u'http://www.linkedin.com/in/sujachandrasekaran&sa=u&ei=1jxuu8qxgtwaygs_u4gaaq&ved=0cceqfjaa'
Tommy N
  • 365
  • 1
  • 4
  • 12
  • 2
    Are you aware that the `&` parts are actually part of the URL? In either case, something like `(http[^&]+)` would work, or more simply make it non-greedy with `(http.*?)&` – sapi Aug 15 '14 at 23:31
  • I got the link using this: linkedin_links = re.findall('(http://www.linkedin.com/in/.*?)&',a) – Tommy N Aug 15 '14 at 23:37

5 Answers5

5

Instead of a regex, use the appropriate tool for the job...

from urlparse import urlparse, parse_qs

url = '/url?q=http://www.linkedin.com/in/sujachandrasekaran&sa=u&ei=gptuu5b6kogtyatduicidq&ved=0cbqqfjaa&usg=afqjcnejdwki_gcnxgzsd4apxey1k2swlw'
qs = parse_qs(urlparse(url).query)['q']
# ['http://www.linkedin.com/in/sujachandrasekaran']

It'll handle escaping, multiple q params and you don't have to worry where it appears in the query params.

Jon Clements
  • 138,671
  • 33
  • 247
  • 280
1

TL;DR: Use '(http.*?)&' instead of '(http.*)&'.

Your regex contains .*. This is by default greedy, meaning that it tries to match as much as possible. In your case, it will therefore match everything up to (but excluding) the last &. Because you want to match only to first &, you must make the regex non-greedy with the ? modifier. .*? tries to match as few characters as possible. Ordinarily, that is an empty string, but because in your case it must be followed by & it will match up to the first &.

hlt
  • 6,219
  • 3
  • 23
  • 43
1

Here is simple regular expression that will do the job correctly in most cases http://[^&]*.

....where [^&]* means: match all characters different from & as many times as possible. However better regular expression must match only characters allowed in URL (not all characters as in my example).

Maybe using dedicated tool is the best you can do but depending on the complexity of the task using regular expression might be just fine and simpler approach.

Boris D. Teoharov
  • 2,319
  • 4
  • 30
  • 49
0

You can use this expression: Select the first group:

/url\?q=([^&]+)

This will select everything after /url?q= and before &.

This will add support for other urls like https and ftp

Gerhard Powell
  • 5,965
  • 5
  • 48
  • 59
0
#! /usr/bin/python

import re

a = "/url?q=http://www.linkedin.com/in/sujachandrasekaran&sa=u&ei=1jxuu8qxgtwaygs_u4gaaq&ved=0cceqfjaa&usg=afqjcnfl2pecdcddktw_pw9nelfohjp0ca"

output = re.split ("\&", a )

final = re.split ("\=", output [0])

print final [1]
John F
  • 99
  • 7