Extracting a URL's in Python from XML

Question

I read this thread about extracting url's from a string. https://stackoverflow.com/a/840014/326905 Really nice, i got all url's from a XML document containing http://www.blabla.com with

>>> s = '<link href="http://www.blabla.com/blah" />
         <link href="http://www.blabla.com" />'
>>> re.findall(r'(https?://\S+)', s)
['http://www.blabla.com/blah"', 'http://www.blabla.com"']

But i can't figure out, how to customize the regex to omit the double qoute at the end of the url.

First i thought that this is the clue

re.findall(r'(https?://\S+\")', s)

or this

re.findall(r'(https?://\S+\Z")', s)

but it isn't.

Can somebody help me out and tell me how to omit the double quote at the end?

Btw. the questionmark after the "s" of https means "s" can occur or can not occur. Am i right?

NEVER ever ever ever ever parse html with regex http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html — That1Guy, Mar 21 '13 at 14:40
You should also read the thread [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) — Abhijit, Mar 21 '13 at 14:41
If you use an HTML parser like BeautifulSoup, this problem will become easier than using regexes. — Waleed Khan, Mar 21 '13 at 14:41
He isn't really parsing HTML... he's mining links from a document. This is a perfectly acceptable use of regex. — Daedalus, Mar 21 '13 at 14:45
yeah i'm parsing a XML, sorry, but i this case the same issue like HTML — surfi, Mar 21 '13 at 14:49
Yes, the question mark as you've used it means the "s" is optional. — Kenneth K., Mar 21 '13 at 14:56
Thanks Kenneth K. for the answer. I can understand that it's bad to regex a HTML. In my case it's a valid XML. So come on ;-) Of course hardcore standards fanatics ... codinghorror — surfi, Mar 21 '13 at 15:32
My opinion has always been: Know what you are doing. Sure, full on parsing of HTML is not going to be feasible for HTML (or XML), but stripping out various pieces of it is certainly practical. The problem the uninitiated have is that they think regex is a golden hammer, and unfortunately they don't fully understand how the hammer works. This is why they end up in regex hell trying to navigate HTML with regex. For your needs, I think you're fine. — Kenneth K., Mar 21 '13 at 18:04

score 2 · Answer 1 · answered Mar 21 '13 at 15:09

2

>>>from lxml import html
>>>ht = html.fromstring(s)
>>>ht.xpath('//a/@href')
['http://www.blabla.com/blah', 'http://www.blabla.com']

answered Mar 21 '13 at 15:09

Drover

116
1
5

score 1 · Answer 2 · answered Mar 21 '13 at 14:42

1

You want the double quotes to appear as a look-ahead:

re.findall(r'(https?://\S+)(?=\")', s)

This way they won't appear as part of the match. Also, yes the ? means the character is optional.

See example here: http://regexr.com?347nk

answered Mar 21 '13 at 14:42

Daedalus

1,667
10
12

Thanasis Petsas · Answer 3 · 2013-03-21T14:51:53.883

I used to extract URLs from text through this piece of code:

url_rgx = re.compile(ur'(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?\xab\xbb\u201c\u201d\u2018\u2019]))')
# convert string to lower case
text = text.lower()
matches = re.findall(url_rgx, text)
# patch the 'http://' part if it is missed
urls = ['http://%s'%url[0] if not url[0].startswith('http') else url[0] for url in matches]
print urls

It works great!

score 1 · Accepted Answer · answered Mar 21 '13 at 15:06

You're already using a character class (albeit a shorthand version). I might suggest modifying the character class a bit, that way you don't need a lookahead. Simply add the quote as part of the character class:

re.findall(r'(https?://[^\s"]+)', s)

This still says "one or more characters not a whitespace," but has the addition of not including double quotes either. So the overall expression is "one or more character not a whitespace and not a double quote."

score 0 · Answer 5 · edited May 23 '17 at 12:28

0

Thanks. I just read this https://stackoverflow.com/a/13057368/326905

and checked out this which is also working.

re.findall(r'"(https?://\S+)"', urls)

edited May 23 '17 at 12:28

Community

1
1

answered Mar 21 '13 at 14:46

surfi

1,451
2
12
25

yes, but if in the text there is a URL with other character such as "><" this will not work. For example for this text: "asd http://www.blabla.com> asdf" it will return: ['http://www.blabla.com>'] which is wrong! – Thanasis Petsas Mar 21 '13 at 14:57

Extracting a URL's in Python from XML

5 Answers5