-2

I'd like to identify the characters within a string that are located relatively to a string I search for.

In other words, if I search for 'Example Text' in the below string, I'd like to identify the immediate characters that come before and after 'Example Text' and also have '<' and '>'.

For example, if I searched the below string for 'Example Text', I'd like the function to return <h3> and </h3>, since those are the characters that come immediately before and after it.

String = "</div><p></p> Random Other Text <h3>Example Text</h3><h3>Coachella Valley Music &amp; Arts Festival</h3><strong>Random Text</strong>:Random Date<br/>"
Chris
  • 5,444
  • 16
  • 63
  • 119
  • 2
    How do you define "immediate characters that come before and after"? `

    ` is 4 characters, while `

    ` is 5. What if it was ``?
    – MattDMo Jul 26 '15 at 23:44
  • 4
    This sounds like an XY problem. You should use an `xml` parser like `lxml` and do an xpath search by text. Any kind of html parsing with regex will only end in tears. – Slater Victoroff Jul 26 '15 at 23:44
  • 2
    why are you parsing html with a regex? – Padraic Cunningham Jul 26 '15 at 23:46
  • @PadraicCunningham, I didn't realize html shouldn't be parsed with regex. I'll leave regex as a tag in case anyone makes the same mistake. – Chris Jul 27 '15 at 02:53

4 Answers4

1

I do not believe you are asking the right question here. I think what you're actually aiming for is:

Given a piece of text, how can I capture the html element that encapsulates it

Very different problem and one that should NEVER be solved with a regex. If you want to know why, just google it.

As far as that other question goes and capturing the relevant html tag I would recommend using lxml. The docs can be found here. For your use case you could do the follows:

>>> from lxml import etree
>>> from StringIO import StringIO

>>> your_string = "</div><p></p> Random Other Text <h3>Example Text</h3><h3>Coachella Valley Music &amp; Arts Festival</h3><strong>Random Text</strong>:Random Date<br/>"

>>> parser = etree.HTMLParser()
>>> document = etree.parse(StringIO(your_string), parser)
>>> elements = document.xpath('//*[text()="Example Text"]')

>>> elements[0].tag
'h3'
Chris
  • 5,444
  • 16
  • 63
  • 119
Slater Victoroff
  • 21,376
  • 21
  • 85
  • 144
0

Reasons to not use regex:

  • Difficulty in defining number of characters to return before and after match.
  • If you match for tags, what do you do if the searched-for text is not immediately surrounded by tags?
  • Obligatory: Tony the Pony says so

If you're parsing HTML/XML, use an HTML/XML parser. lxml is a good one, I personally prefer using BeautifulSoup, as it uses lxml for some of its heavy lifting, but has other features as well, and is more user-friendly, especially for quick matches.

Community
  • 1
  • 1
MattDMo
  • 100,794
  • 21
  • 241
  • 231
  • @SlaterTyranus as mentioned elsewhere, your code breaks with valid HTML fragments but invalid XML. – MattDMo Jul 26 '15 at 23:58
  • @heinst you are absolutely wrong in your last comment. Your answer is not robust. **That** is why I downvoted. If it had worked, I wouldn't have done that. Whining about SO sucking because it's working as intended is pointless. – MattDMo Jul 27 '15 at 00:09
  • @MattDMo I'd also like to note that this isn't really an answer... I assume you're going to delete it or add an answer to it? – Slater Victoroff Jul 27 '15 at 00:09
  • @Slater and why isn't this an answer? – MattDMo Jul 27 '15 at 00:09
0

I believe it can be done by beautifulsoup

from BeautifulSoup import BeautifulSoup

String = "</div><p></p> Random Other Text <h3>Example Text</h3><h3>Coachella Valley Music &amp; Arts Festival</h3><strong>Random Text</strong>:Random Date<br/>"

soup = BeautifulSoup(String)

input = 'Example Text'
for elem in soup(text=input):
    print(str(elem.parent).replace(input,'') )
Slater Victoroff
  • 21,376
  • 21
  • 85
  • 144
galaxyan
  • 5,944
  • 2
  • 19
  • 43
-2

You can use the regex <[^>]*> to match a tag, then use groups defined with parentheses to separate your match into the blocks that you want:

m = re.search("(<[^>]*>)Example Text(<[^>]*>)", String)
m.groups()
Out[7]: ('<h3>', '</h3>')
maxymoo
  • 35,286
  • 11
  • 92
  • 119