How to get the first occurrence ? regex python

Question

I have this html tag:

x=""" <div>ad</div>  \n\n <div> correct value  </div>  <div> wrong value </div>   """

I want to get the corret value

so I search for the word ad followed by </div> then any thing until another <div> then get all the values until </div>

I use this code:

re.findall(r'ad</div>.*<div>(.*)</div>',x,re.S)

I use the falg re.S because I want the dot to match new line also. I don't know How much lines there are between the divs. so i use .* !

I think findall should return correct value, but it return wrong value. why ? it search for the last div not the first one ?

Please read this: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags#answer-1732454 — Daniel Roseman, Nov 27 '14 at 12:19
@DanielRoseman so I shouldn't user regex to parse html, what you suggest ? — david, Nov 27 '14 at 12:36

score 6 · Accepted Answer · answered Nov 27 '14 at 12:09

6

because what you have is greedy

try lazy :

re.findall(r'ad</div>.*?<div>(.*?)</div>',x,re.S)

In your example .* is actually matching everything towards the end and then it sees <div>, then your regex back tracks and and startes the matching again, similar is the second scenario,

demo here :

http://regex101.com/r/zY9xA3/1

answered Nov 27 '14 at 12:09

aelor

10,892
3
32
48

2

@downvoter : your downvote doesnt mean anything unless you add a comment along with it – aelor Nov 27 '14 at 12:12
thank you for your answer, but what the difference between .*? and .* – david Nov 27 '14 at 12:12
1

`.*?` will match minimum characters while `.*` will do a maximum match. example `a.*b` will match `aabbcccddb` fully while `a.*?b` will match only upto the first b i.e `aab` – aelor Nov 27 '14 at 12:14

score 0 · Answer 2 · answered Nov 27 '14 at 12:21

If you want to find a thing between 2 special string , use Lookahead and Lookbehind Assertions :

>>> re.findall(r'(?<=\<div\>)[\w ]+(?=\<\/div\>)',x)
['ad', ' correct value  ', ' wrong value ']
>>> re.findall(r'(?<=\<div\>)[\w ]+(?=\<\/div\>)',x)[1].strip()
'correct value'

score 0 · Answer 3 · answered Nov 27 '14 at 12:22

0

ad</div>((?!<div>).)*<div>(((?!<\/div>).)*)</div>

You can try this well.See demo.

http://regex101.com/r/zY9xA3/3

answered Nov 27 '14 at 12:22

vks

67,027
10
91
124

1

it's complicated !!! – david Nov 27 '14 at 12:27

score 0 · Answer 4 · answered Nov 27 '14 at 13:04

Through the tool which was specially used for parsing html files.

>>> from bs4 import BeautifulSoup
>>> x=""" <div>ad</div>  \n\n <div> correct value  </div>  <div> wrong value </div>   """
>>> soup = BeautifulSoup(x)
>>> for i, x in enumerate(soup.find_all('div')):
    if x.string == 'ad':
        count = count + i + 1


>>> count
1
>>> soup.find_all('div')[count].string
' correct value  '
>>> soup.find_all('div')[count].string.strip()
'correct value'

How to get the first occurrence ? regex python

4 Answers4