For a so simple task, that consists in ANLYZING the string, not PARSING it (parsing = building a tree representation of the text), you can do :
the text
ss = '''
Humpty Dumpty sat on a wall
<div class="class1">
Stock Number:
Z2079
<br>
**VIN:
2T2HK31UX9C110701**
<br>
Model Code:
9424
<img class="imgcert" src="/images/Lexus_cpo.jpg">
</div>
Humpty Dumpty had a great fall
<ul cat="zoo">
Stock Number:
ARDEN3125
<br>
**VIN:
SHAKAMOSK-230478-UBUN**
</br>
Model Code:
101
<img class="imgcert" src="/images/Magana_cpo.jpg">
</ul>
All the king's horses and all the king's men
<artifice>
<baradino>
Stock Number:
DERT5178
<br>
**VIN:
Pandaia-67-Moro**
<br>
Model Code:
1234
<img class="imgcert" src="/images/Pertuis_cpo.jpg">
</baradino>
what what what who what
<somerset who="maugham">
Nothing to declare
</somerset>
</artifice>
Couldn't put Humpty Dumpty again
<ending rtf="simi">
Stock Number:
ZZZ789
<br>
**VIN:
0000012554-ENDENDEND**
<br>
Model Code:
QS78-9
<img class="imgcert" src="/images/Sunny_cpo.jpg">
</ending>
qsdjgqsjkdhfqjkdhgfjkqshgdfkjqsdjfkh'''
the code:
import re
regx = re.compile('<([^ >]+) ?([^>]*)>'
'(?!.+?<(?!br>)[^ >]+>.+?<br>.+?</\\1>)'
'.*?\*\*VIN:(.+?)\*\*.+?</\\1>',
re.DOTALL)
li = [ (mat.group(1),mat.group(2),mat.group(3).strip(' \n\r\t'))
for mat in regx.finditer(ss) ]
for el in li:
print '(%-15r, %-25r, %-25r)' % el
the result
('div' , 'class="class1"' , '2T2HK31UX9C110701' )
('ul' , 'cat="zoo"' , 'SHAKAMOSK-230478-UBUN' )
('baradino' , '' , 'Pandaia-67-Moro' )
('ending' , 'rtf="simi"' , '0000012554-ENDENDEND' )
re.DOTALL
is necessary to give to the dot symbol the ability to match also the newline (by default , a dot in a regular expression pattern matches every character except newlines)
\\1
is way to specify that at this place in the examined string, there must be the same portion of string that is captured by the first group, that is to say the part ([^ >]+)
'(?!.+?<(?!br>)[^ >]+>.+?<br>.+?</\\1>)'
is a part that says that it is forbidden to find a tag other than <br>
before the first tag <br>
encountered between an opening tag and the closing tag of an HTML element.
This part is necessary to catch the closest preceding tag before VIM apart <br>
If this part isn't present , the regex
regx = re.compile('<([^ >]+) ?([^>]*)>'
'.*?\*\*VIN:(.+?)\*\*.+?</\\1>',
re.DOTALL)
catches the following result:
('div' , 'class="class1"' , '2T2HK31UX9C110701' )
('ul' , 'cat="zoo"' , 'SHAKAMOSK-230478-UBUN' )
('artifice' , '' , 'Pandaia-67-Moro' )
('ending' , 'rtf="simi"' , '0000012554-ENDENDEND' )
The difference is 'artifice' instead of 'baradino'