0

I'm just starting regular expression for python and came across this problem where I'm supposed to extract URLs from the string:

str = "<tag>http://example-1.com</tag><tag>http://example-2.com</tag>"

The code I have is:

import re

url = re.findall('<tag>(.*)</tag>', str)

print(url)

returns:

[http://example-1.com</tag><tag>http://example-2.com]

If anyone could point me in the direction on how I might approach this problem would it would be most appreciative!

Thanks everyone!

Cuppy
  • 103
  • 1
  • 1
  • 8
  • Use `.*?` non-greedy instead of `.*` greedy one Or use `[^>]*` instead of `.*` OR best use a HTML parser – Pushpesh Kumar Rajwanshi Apr 01 '19 at 10:29
  • 1
    Oh wow thanks! That worked perfectly! I'll go read up on greedy and non greedy ones a bit more! I did consider a parser but I wanted to try it in RE since it was a question under that topic. Thank you so much! – Cuppy Apr 01 '19 at 10:30

2 Answers2

2

You are using a regular expression, and matching HTML with such expressions get too complicated, too fast.

You can use BeautifulSoup to parse HTML.

For example:

from bs4 import BeautifulSoup

str = "<tag>http://example-1.com</tag><tag>http://example-2.com</tag>"
soup = BeautifulSoup(str, 'html.parser')
tags = soup.find_all('tag')
for tag in tags:
        print tag.text

Ion Batîr
  • 171
  • 1
  • 1
  • 8
  • 1
    Thank you very much! I am going through an online course and wanted to keep it within the topic of RE so I decided to try it with just the RE library (which is still a mystery to me..!) – Cuppy Apr 01 '19 at 10:46
1

Using only re package:

import re
str = "<tag>http://example-1.com</tag><tag>http://example-2.com</tag>"
url = re.findall('<tag>(.*?)</tag>', str)
print(url)

returns:

['http://example-1.com', 'http://example-2.com']

Hope it helps!

JLD
  • 89
  • 5
  • Thanks! Thank worked! I'm now trying to figure out about greedy and non-greedy and why it would work in this instance. – Cuppy Apr 01 '19 at 10:48
  • Here the `.*?` is matches as few times as possible (lazy), i.e. when it finds the first closing `` it stops, whereas in the greedy `.*` as soon it finds the 2nd closing `` it matches the whole pattern. – guroosh Apr 01 '19 at 11:20
  • Oh wow, your explanation made so much more sense than all the tutorials I have been reading/watching! Thanks! – Cuppy Apr 02 '19 at 04:49