1

For some reason I need to use regular expressions to extract some data from a web site. The data has similar HTML structure, only text differs. For simplicity I show it this way:

p = '<div class="col-xs-6"><p>Gender:</p></div><div class="col-xs-6"><a href="/skor/herr">Herr</a>, <a href="/skor/dam">Dam</a></div>'
t = '<div class="col-xs-6"><p>Kategori:</p></div><div class="col-xs-6"><a href="/skor/kangor-boots">Boots</a></div>'
s = p + t

I am only interested in 'Gender' which means I want to extract 'Herr' and 'Dam' only.

So far I came up with two options - both not working:

m = re.findall("Gender.+?<div.+?>([\w ]+)<\/.+?<\/div>", s, re.DOTALL)

gives:

['Herr']

I guess because it is non-greedy

But if I make it greedy:

re.findall("Gender.+?<div.+>([\w ]+)<\/.+?<\/div>", s, re.DOTALL)

It returns:

['Boots']

So I am struggling to figure out how to get both 'Herr' and 'Dam' and nothing more?

Agenobarb
  • 143
  • 2
  • 10
  • So if you know beforehand that only "Herr" and "Dam" will be what you want, why not search only for this? Unless you want to generalize this for other possible values. – Shan Oct 02 '18 at 14:55
  • Exactly. I want to generalize this for other possible values of 'Gender' and not only – Agenobarb Oct 02 '18 at 14:57

1 Answers1

1

You can use BeautifulSoup in such a way

from bs4 import BeautifulSoup
a='<div class="col-xs-6"><p>Gender:</p></div><div class="col-xs-6"><a href="/skor/herr">Herr</a>, <a href="/skor/dam">Dam</a></div>'
soup = BeautifulSoup(a,"html.parser")
if 'Gender' in (str(soup.findAll('div'))):
    for ana in soup.findAll('div'):
        for i in ana.findAll('a'):
            print(i.next_element) 

Output:

Herr
Dam

I would recommend to add name attribute to the divs so it would be easier to determine the correct tags

p = '<div name="Gender" class="col-xs-6"><p>Gender:</p></div><div name="Gender" class="col-xs-6"><a href="/skor/herr">Herr</a>, <a href="/skor/dam">Dam</a></div>'
t = '<div class="col-xs-6"><p>Kategori:</p></div><div class="col-xs-6"><a href="/skor/kangor-boots">Boots</a></div>'
a = p + t

soup = BeautifulSoup(a,"html.parser")
for ana in soup.findAll('div',{"name":"Gender"}):
    for i in ana.findAll('a'):
        print(i.next_element) 

Output:

 Herr
 Dam
mad_
  • 8,121
  • 2
  • 25
  • 40
  • I know about BS. However for some reason in this case I would prefer regex based solution. I don't have a solid knowledge in regular expressions but I was under impression that almost everything is possible using them in terms of text matching and extraction. I hope there is a solution to do that with RE. – Agenobarb Oct 02 '18 at 15:39
  • the problem with using regex is forming a generalized rule is different. Can you modify the tags with name attribute only attached to divs which are associated with gender then it would be a lot easier and would make more sense – mad_ Oct 02 '18 at 15:46