Extracting URL from a string

Question

I'm just starting regular expression for python and came across this problem where I'm supposed to extract URLs from the string:

str = "<tag>http://example-1.com</tag><tag>http://example-2.com</tag>"

The code I have is:

import re

url = re.findall('<tag>(.*)</tag>', str)

print(url)

returns:

[http://example-1.com</tag><tag>http://example-2.com]

If anyone could point me in the direction on how I might approach this problem would it would be most appreciative!

Thanks everyone!

Use `.*?` non-greedy instead of `.*` greedy one Or use `[^>]*` instead of `.*` OR best use a HTML parser — Pushpesh Kumar Rajwanshi, Apr 01 '19 at 10:29
Oh wow thanks! That worked perfectly! I'll go read up on greedy and non greedy ones a bit more! I did consider a parser but I wanted to try it in RE since it was a question under that topic. Thank you so much! — Cuppy, Apr 01 '19 at 10:30

Ion Batîr · Answer 1 · 2019-04-01T10:31:30.910

2

You are using a regular expression, and matching HTML with such expressions get too complicated, too fast.

You can use BeautifulSoup to parse HTML.

For example:

from bs4 import BeautifulSoup

str = "<tag>http://example-1.com</tag><tag>http://example-2.com</tag>"
soup = BeautifulSoup(str, 'html.parser')
tags = soup.find_all('tag')
for tag in tags:
        print tag.text

edited Apr 01 '19 at 10:31

answered Apr 01 '19 at 10:20

Ion Batîr

171
1
1
8

1

Thank you very much! I am going through an online course and wanted to keep it within the topic of RE so I decided to try it with just the RE library (which is still a mystery to me..!) – Cuppy Apr 01 '19 at 10:46

score 1 · Answer 2 · answered Apr 01 '19 at 10:35

1

Using only re package:

import re
str = "<tag>http://example-1.com</tag><tag>http://example-2.com</tag>"
url = re.findall('<tag>(.*?)</tag>', str)
print(url)

returns:

['http://example-1.com', 'http://example-2.com']

Hope it helps!

answered Apr 01 '19 at 10:35

JLD

89
5

Thanks! Thank worked! I'm now trying to figure out about greedy and non-greedy and why it would work in this instance. – Cuppy Apr 01 '19 at 10:48
Here the `.*?` is matches as few times as possible (lazy), i.e. when it finds the first closing `` it stops, whereas in the greedy `.*` as soon it finds the 2nd closing `` it matches the whole pattern. – guroosh Apr 01 '19 at 11:20
Oh wow, your explanation made so much more sense than all the tutorials I have been reading/watching! Thanks! – Cuppy Apr 02 '19 at 04:49

Extracting URL from a string

2 Answers2