1

I have this html tag:

x=""" <div>ad</div>  \n\n <div> correct value  </div>  <div> wrong value </div>   """

I want to get the corret value

so I search for the word ad followed by </div> then any thing until another <div> then get all the values until </div>

I use this code:

re.findall(r'ad</div>.*<div>(.*)</div>',x,re.S)

I use the falg re.S because I want the dot to match new line also. I don't know How much lines there are between the divs. so i use .* !

I think findall should return correct value, but it return wrong value. why ? it search for the last div not the first one ?

david
  • 3,310
  • 7
  • 36
  • 59

4 Answers4

6

because what you have is greedy

try lazy :

re.findall(r'ad</div>.*?<div>(.*?)</div>',x,re.S)

In your example .* is actually matching everything towards the end and then it sees <div>, then your regex back tracks and and startes the matching again, similar is the second scenario,

demo here :

http://regex101.com/r/zY9xA3/1

aelor
  • 10,892
  • 3
  • 32
  • 48
  • 2
    @downvoter : your downvote doesnt mean anything unless you add a comment along with it – aelor Nov 27 '14 at 12:12
  • thank you for your answer, but what the difference between .*? and .* – david Nov 27 '14 at 12:12
  • 1
    `.*?` will match minimum characters while `.*` will do a maximum match. example `a.*b` will match `aabbcccddb` fully while `a.*?b` will match only upto the first b i.e `aab` – aelor Nov 27 '14 at 12:14
0

If you want to find a thing between 2 special string , use Lookahead and Lookbehind Assertions :

>>> re.findall(r'(?<=\<div\>)[\w ]+(?=\<\/div\>)',x)
['ad', ' correct value  ', ' wrong value ']
>>> re.findall(r'(?<=\<div\>)[\w ]+(?=\<\/div\>)',x)[1].strip()
'correct value'
Mazdak
  • 105,000
  • 18
  • 159
  • 188
0
ad</div>((?!<div>).)*<div>(((?!<\/div>).)*)</div>

You can try this well.See demo.

http://regex101.com/r/zY9xA3/3

vks
  • 67,027
  • 10
  • 91
  • 124
0

Through the tool which was specially used for parsing html files.

>>> from bs4 import BeautifulSoup
>>> x=""" <div>ad</div>  \n\n <div> correct value  </div>  <div> wrong value </div>   """
>>> soup = BeautifulSoup(x)
>>> for i, x in enumerate(soup.find_all('div')):
    if x.string == 'ad':
        count = count + i + 1


>>> count
1
>>> soup.find_all('div')[count].string
' correct value  '
>>> soup.find_all('div')[count].string.strip()
'correct value'
Avinash Raj
  • 172,303
  • 28
  • 230
  • 274