For some reason I need to use regular expressions to extract some data from a web site. The data has similar HTML structure, only text differs. For simplicity I show it this way:
p = '<div class="col-xs-6"><p>Gender:</p></div><div class="col-xs-6"><a href="/skor/herr">Herr</a>, <a href="/skor/dam">Dam</a></div>'
t = '<div class="col-xs-6"><p>Kategori:</p></div><div class="col-xs-6"><a href="/skor/kangor-boots">Boots</a></div>'
s = p + t
I am only interested in 'Gender' which means I want to extract 'Herr' and 'Dam' only.
So far I came up with two options - both not working:
m = re.findall("Gender.+?<div.+?>([\w ]+)<\/.+?<\/div>", s, re.DOTALL)
gives:
['Herr']
I guess because it is non-greedy
But if I make it greedy:
re.findall("Gender.+?<div.+>([\w ]+)<\/.+?<\/div>", s, re.DOTALL)
It returns:
['Boots']
So I am struggling to figure out how to get both 'Herr' and 'Dam' and nothing more?