I'm writing a Python3 program that appears below. I want to capture all the href
attributes so that any data appearing between the double quotes is captured. e.g., < a href="...">< /a>
The below code works well for this task, but after looking at data from multiple sites, it'd be nice to only capture data in the regex if a "." is part of the captured group.
So for instance if the data that's returned is: http://money.cnn.com, /health, /asia
I'd only want to see: http://money.cnn.com
I've tried various flavors of regex-assertions but the thing is I want to include the period as well as use it as a filter.
I also realize I could use a succeeding list filter or list comprehension filter to achieve this, but I was hoping to do it using regex.
This regex appears to work:
href=["\'](?=[^\.]*[\.])(.*?)["\']
but not in the Python code below. (Also see here: https://ibb.co/hQwjPm)
import codecs
import re
response = urllib.request.urlopen('http://www.cnn.com').read()
hrefs = re.findall(r'href=["\'](.*?)["\']', codecs.decode(response))