0

I'm writing a Python3 program that appears below. I want to capture all the href attributes so that any data appearing between the double quotes is captured. e.g., < a href="...">< /a>

The below code works well for this task, but after looking at data from multiple sites, it'd be nice to only capture data in the regex if a "." is part of the captured group.

So for instance if the data that's returned is: http://money.cnn.com, /health, /asia

I'd only want to see: http://money.cnn.com

I've tried various flavors of regex-assertions but the thing is I want to include the period as well as use it as a filter.

I also realize I could use a succeeding list filter or list comprehension filter to achieve this, but I was hoping to do it using regex.

This regex appears to work:

href=["\'](?=[^\.]*[\.])(.*?)["\']

but not in the Python code below. (Also see here: https://ibb.co/hQwjPm)

import codecs
import re
response = urllib.request.urlopen('http://www.cnn.com').read()   
hrefs = re.findall(r'href=["\'](.*?)["\']', codecs.decode(response)) 
Charles Saag
  • 611
  • 5
  • 20
  • 1
    1. Use an HTML parser, not regex, for parsing HTML. 2. Once you have them, either way, what stops you from checking `'.' in href`? – jonrsharpe Dec 16 '17 at 20:26
  • jon, i was hoping to use regex to do this – Charles Saag Dec 16 '17 at 20:27
  • 1
    For reference: https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 Your specific question is a bit narrower than parsing all of HTML though, so you probably can make a regex work if you really want to (though an HTML parser might be easier). I'd think `r'href=["\']([^"\']*\.[^"\']*)["\']'` would work (you don't need a lookahead). – Blckknght Dec 16 '17 at 23:54
  • Blckknght, thanks much! Very clever! – Charles Saag Dec 17 '17 at 00:38

0 Answers0