0

I read this thread about extracting url's from a string. https://stackoverflow.com/a/840014/326905 Really nice, i got all url's from a XML document containing http://www.blabla.com with

>>> s = '<link href="http://www.blabla.com/blah" />
         <link href="http://www.blabla.com" />'
>>> re.findall(r'(https?://\S+)', s)
['http://www.blabla.com/blah"', 'http://www.blabla.com"']

But i can't figure out, how to customize the regex to omit the double qoute at the end of the url.

First i thought that this is the clue

re.findall(r'(https?://\S+\")', s)

or this

re.findall(r'(https?://\S+\Z")', s)

but it isn't.

Can somebody help me out and tell me how to omit the double quote at the end?

Btw. the questionmark after the "s" of https means "s" can occur or can not occur. Am i right?

Community
  • 1
  • 1
surfi
  • 1,451
  • 2
  • 12
  • 25
  • 1
    NEVER ever ever ever ever parse html with regex http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html – That1Guy Mar 21 '13 at 14:40
  • You should also read the thread [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – Abhijit Mar 21 '13 at 14:41
  • 1
    If you use an HTML parser like BeautifulSoup, this problem will become easier than using regexes. – Waleed Khan Mar 21 '13 at 14:41
  • 3
    He isn't really parsing HTML... he's mining links from a document. This is a perfectly acceptable use of regex. – Daedalus Mar 21 '13 at 14:45
  • yeah i'm parsing a XML, sorry, but i this case the same issue like HTML – surfi Mar 21 '13 at 14:49
  • Yes, the question mark as you've used it means the "s" is optional. – Kenneth K. Mar 21 '13 at 14:56
  • Thanks Kenneth K. for the answer. I can understand that it's bad to regex a HTML. In my case it's a valid XML. So come on ;-) Of course hardcore standards fanatics ... codinghorror – surfi Mar 21 '13 at 15:32
  • My opinion has always been: Know what you are doing. Sure, full on parsing of HTML is not going to be feasible for HTML (or XML), but stripping out various pieces of it is certainly practical. The problem the uninitiated have is that they think regex is a golden hammer, and unfortunately they don't fully understand how the hammer works. This is why they end up in regex hell trying to navigate HTML with regex. For your needs, I think you're fine. – Kenneth K. Mar 21 '13 at 18:04

5 Answers5

2
>>>from lxml import html
>>>ht = html.fromstring(s)
>>>ht.xpath('//a/@href')
['http://www.blabla.com/blah', 'http://www.blabla.com']
Drover
  • 116
  • 1
  • 5
1

You want the double quotes to appear as a look-ahead:

re.findall(r'(https?://\S+)(?=\")', s)

This way they won't appear as part of the match. Also, yes the ? means the character is optional.

See example here: http://regexr.com?347nk

Daedalus
  • 1,667
  • 10
  • 12
1

I used to extract URLs from text through this piece of code:

url_rgx = re.compile(ur'(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?\xab\xbb\u201c\u201d\u2018\u2019]))')
# convert string to lower case
text = text.lower()
matches = re.findall(url_rgx, text)
# patch the 'http://' part if it is missed
urls = ['http://%s'%url[0] if not url[0].startswith('http') else url[0] for url in matches]
print urls

It works great!

Thanasis Petsas
  • 4,378
  • 5
  • 31
  • 57
1

You're already using a character class (albeit a shorthand version). I might suggest modifying the character class a bit, that way you don't need a lookahead. Simply add the quote as part of the character class:

re.findall(r'(https?://[^\s"]+)', s)

This still says "one or more characters not a whitespace," but has the addition of not including double quotes either. So the overall expression is "one or more character not a whitespace and not a double quote."

Kenneth K.
  • 2,987
  • 1
  • 23
  • 30
0

Thanks. I just read this https://stackoverflow.com/a/13057368/326905

and checked out this which is also working.

re.findall(r'"(https?://\S+)"', urls) 
Community
  • 1
  • 1
surfi
  • 1,451
  • 2
  • 12
  • 25
  • yes, but if in the text there is a URL with other character such as "><" this will not work. For example for this text: "asd http://www.blabla.com> asdf" it will return: ['http://www.blabla.com>'] which is wrong! – Thanasis Petsas Mar 21 '13 at 14:57