I have a text, in which only <b>
and </b>
has been used.for example<b>abcd efg-123</b>
. Can can I extract the string between these tags? also I need to extract 3 words before and after this chunk of <b>abcd efg-123</b>
string.
How can I do that? what would be the suitable regular expression for this?
Asked
Active
Viewed 147 times
0

Hossein
- 40,161
- 57
- 141
- 175
-
2obligatory: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Lie Ryan Oct 20 '10 at 13:46
4 Answers
3
this will get what's in between the tags,
>>> s="1 2 3<b>abcd efg-123</b>one two three"
>>> for i in s.split("</b>"):
... if "<b>" in i:
... print i.split("<b>")[-1]
...
abcd efg-123

ghostdog74
- 327,991
- 56
- 259
- 343
1
This is actually a very dumb version and doesn't allow nested tags.
re.search(r"(\w+)\s+(\w+)\s+(\w+)\s+<b>([^<]+)</b>\s+(\w+)\s+(\w+)\s+(\w+)", text)

eric_arthur_blair
- 11
- 2
1
Handles tags inside the <b>
unless they are <b>
ofcouse.
import re
sometext = 'blah blah 1 2 3<b>abcd efg-123</b>word word2 word3 blah blah'
result = re.findall(
r'(((?:(?:^|\s)+\w+){3}\s*)' # Match 3 words before
r'<b>([^<]*|<[^/]|</[^b]|</b[^>])</b>' # Match <b>...</b>
r'(\s*(?:\w+(?:\s+|$)){3}))', sometext) # Match 3 words after
result == [(' 1 2 3<b>abcd efg-123</b>word word2 word3 ',
' 1 2 3',
'abcd efg-123',
'word word2 word3 ')]
This should work, and perform well, but if it gets any more advanced then this you should consider using a html parser.

driax
- 2,528
- 1
- 23
- 20
-
this doesn't work if there is no words before or after, or less that 3 words, right? – Hossein Oct 20 '10 at 14:30
-
@Hossein That's correct. However it is a simple change. Change {3} to {,3} – driax Oct 20 '10 at 16:42
0
You should not use regexes for HTML parsing. That way madness lies.
The above-linked article actually provides a regex for your problem -- but don't use it.

Cœur
- 37,241
- 25
- 195
- 267

Joshua Fox
- 18,704
- 23
- 87
- 147