Python regex: getting text from html elements with similar structure

Question

For some reason I need to use regular expressions to extract some data from a web site. The data has similar HTML structure, only text differs. For simplicity I show it this way:

p = '<div class="col-xs-6"><p>Gender:</p></div><div class="col-xs-6"><a href="/skor/herr">Herr</a>, <a href="/skor/dam">Dam</a></div>'
t = '<div class="col-xs-6"><p>Kategori:</p></div><div class="col-xs-6"><a href="/skor/kangor-boots">Boots</a></div>'
s = p + t

I am only interested in 'Gender' which means I want to extract 'Herr' and 'Dam' only.

So far I came up with two options - both not working:

m = re.findall("Gender.+?<div.+?>([\w ]+)<\/.+?<\/div>", s, re.DOTALL)

gives:

['Herr']

I guess because it is non-greedy

But if I make it greedy:

re.findall("Gender.+?<div.+>([\w ]+)<\/.+?<\/div>", s, re.DOTALL)

It returns:

['Boots']

So I am struggling to figure out how to get both 'Herr' and 'Dam' and nothing more?

So if you know beforehand that only "Herr" and "Dam" will be what you want, why not search only for this? Unless you want to generalize this for other possible values. — Shan, Oct 02 '18 at 14:55
Exactly. I want to generalize this for other possible values of 'Gender' and not only — Agenobarb, Oct 02 '18 at 14:57

mad_ · Answer 1 · 2018-10-02T15:53:53.627

You can use BeautifulSoup in such a way

from bs4 import BeautifulSoup
a='<div class="col-xs-6"><p>Gender:</p></div><div class="col-xs-6"><a href="/skor/herr">Herr</a>, <a href="/skor/dam">Dam</a></div>'
soup = BeautifulSoup(a,"html.parser")
if 'Gender' in (str(soup.findAll('div'))):
    for ana in soup.findAll('div'):
        for i in ana.findAll('a'):
            print(i.next_element)

Output:

Herr
Dam

I would recommend to add name attribute to the divs so it would be easier to determine the correct tags

p = '<div name="Gender" class="col-xs-6"><p>Gender:</p></div><div name="Gender" class="col-xs-6"><a href="/skor/herr">Herr</a>, <a href="/skor/dam">Dam</a></div>'
t = '<div class="col-xs-6"><p>Kategori:</p></div><div class="col-xs-6"><a href="/skor/kangor-boots">Boots</a></div>'
a = p + t

soup = BeautifulSoup(a,"html.parser")
for ana in soup.findAll('div',{"name":"Gender"}):
    for i in ana.findAll('a'):
        print(i.next_element)

Output:

 Herr
 Dam

I know about BS. However for some reason in this case I would prefer regex based solution. I don't have a solid knowledge in regular expressions but I was under impression that almost everything is possible using them in terms of text matching and extraction. I hope there is a solution to do that with RE. — Agenobarb, Oct 02 '18 at 15:39
the problem with using regex is forming a generalized rule is different. Can you modify the tags with name attribute only attached to divs which are associated with gender then it would be a lot easier and would make more sense — mad_, Oct 02 '18 at 15:46

Python regex: getting text from html elements with similar structure

1 Answers1