0

I have a text, in which only <b> and </b> has been used.for example<b>abcd efg-123</b> . Can can I extract the string between these tags? also I need to extract 3 words before and after this chunk of <b>abcd efg-123</b> string. How can I do that? what would be the suitable regular expression for this?

Hossein
  • 40,161
  • 57
  • 141
  • 175
  • 2
    obligatory: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Lie Ryan Oct 20 '10 at 13:46

4 Answers4

3

this will get what's in between the tags,

>>> s="1 2 3<b>abcd efg-123</b>one two three"
>>> for i in s.split("</b>"):
...   if "<b>" in i:
...      print i.split("<b>")[-1]
...
abcd efg-123
ghostdog74
  • 327,991
  • 56
  • 259
  • 343
1

This is actually a very dumb version and doesn't allow nested tags.

re.search(r"(\w+)\s+(\w+)\s+(\w+)\s+<b>([^<]+)</b>\s+(\w+)\s+(\w+)\s+(\w+)", text)

See Python documentation.

1

Handles tags inside the <b> unless they are <b> ofcouse.

import re    
sometext = 'blah blah 1 2 3<b>abcd efg-123</b>word word2 word3 blah blah'
result = re.findall(
      r'(((?:(?:^|\s)+\w+){3}\s*)'            # Match 3 words before
      r'<b>([^<]*|<[^/]|</[^b]|</b[^>])</b>'  # Match <b>...</b>
      r'(\s*(?:\w+(?:\s+|$)){3}))', sometext) # Match 3 words after

result == [(' 1 2 3<b>abcd efg-123</b>word word2 word3 ',
    ' 1 2 3',
    'abcd efg-123',
    'word word2 word3 ')]

This should work, and perform well, but if it gets any more advanced then this you should consider using a html parser.

driax
  • 2,528
  • 1
  • 23
  • 20
0

You should not use regexes for HTML parsing. That way madness lies.

The above-linked article actually provides a regex for your problem -- but don't use it.

Cœur
  • 37,241
  • 25
  • 195
  • 267
Joshua Fox
  • 18,704
  • 23
  • 87
  • 147