How to use a conditional regex expression in a Python3 findall method?

Question

I'm writing a Python3 program that appears below. I want to capture all the href attributes so that any data appearing between the double quotes is captured. e.g., < a href="...">< /a>

The below code works well for this task, but after looking at data from multiple sites, it'd be nice to only capture data in the regex if a "." is part of the captured group.

So for instance if the data that's returned is: http://money.cnn.com, /health, /asia

I'd only want to see: http://money.cnn.com

I've tried various flavors of regex-assertions but the thing is I want to include the period as well as use it as a filter.

I also realize I could use a succeeding list filter or list comprehension filter to achieve this, but I was hoping to do it using regex.

This regex appears to work:

href=["\'](?=[^\.]*[\.])(.*?)["\']

but not in the Python code below. (Also see here: https://ibb.co/hQwjPm)

import codecs
import re
response = urllib.request.urlopen('http://www.cnn.com').read()   
hrefs = re.findall(r'href=["\'](.*?)["\']', codecs.decode(response))

1. Use an HTML parser, not regex, for parsing HTML. 2. Once you have them, either way, what stops you from checking `'.' in href`? — jonrsharpe, Dec 16 '17 at 20:26
For reference: https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 Your specific question is a bit narrower than parsing all of HTML though, so you probably can make a regex work if you really want to (though an HTML parser might be easier). I'd think `r'href=["\']([^"\']*\.[^"\']*)["\']'` would work (you don't need a lookahead). — Blckknght, Dec 16 '17 at 23:54

How to use a conditional regex expression in a Python3 findall method?

0 Answers0