Identify characters in a string by their relative position to a searched string?

Question

I'd like to identify the characters within a string that are located relatively to a string I search for.

In other words, if I search for 'Example Text' in the below string, I'd like to identify the immediate characters that come before and after 'Example Text' and also have '<' and '>'.

For example, if I searched the below string for 'Example Text', I'd like the function to return <h3> and </h3>, since those are the characters that come immediately before and after it.

String = "</div><p></p> Random Other Text <h3>Example Text</h3><h3>Coachella Valley Music &amp; Arts Festival</h3><strong>Random Text</strong>:Random Date<br/>"

How do you define "immediate characters that come before and after"? `
` is 4 characters, while `
` is 5. What if it was ``? — MattDMo, Jul 26 '15 at 23:44
This sounds like an XY problem. You should use an `xml` parser like `lxml` and do an xpath search by text. Any kind of html parsing with regex will only end in tears. — Slater Victoroff, Jul 26 '15 at 23:44
@PadraicCunningham, I didn't realize html shouldn't be parsed with regex. I'll leave regex as a tag in case anyone makes the same mistake. — Chris, Jul 27 '15 at 02:53

score 1 · Accepted Answer · edited Jul 27 '15 at 03:16

I do not believe you are asking the right question here. I think what you're actually aiming for is:

Given a piece of text, how can I capture the html element that encapsulates it

Very different problem and one that should NEVER be solved with a regex. If you want to know why, just google it.

As far as that other question goes and capturing the relevant html tag I would recommend using lxml. The docs can be found here. For your use case you could do the follows:

>>> from lxml import etree
>>> from StringIO import StringIO

>>> your_string = "</div><p></p> Random Other Text <h3>Example Text</h3><h3>Coachella Valley Music &amp; Arts Festival</h3><strong>Random Text</strong>:Random Date<br/>"

>>> parser = etree.HTMLParser()
>>> document = etree.parse(StringIO(your_string), parser)
>>> elements = document.xpath('//*[text()="Example Text"]')

>>> elements[0].tag
'h3'

I'm getting the following error: NameError: global name 'StringIO' is not defined — Chris, Jul 27 '15 at 02:43

score 0 · Answer 2 · edited May 23 '17 at 12:14

0

Reasons to not use regex:

Difficulty in defining number of characters to return before and after match.
If you match for tags, what do you do if the searched-for text is not immediately surrounded by tags?
Obligatory: Tony the Pony says so

If you're parsing HTML/XML, use an HTML/XML parser. lxml is a good one, I personally prefer using BeautifulSoup, as it uses lxml for some of its heavy lifting, but has other features as well, and is more user-friendly, especially for quick matches.

edited May 23 '17 at 12:14

Community

1
1

answered Jul 26 '15 at 23:52

MattDMo

100,794
21
241
231

@SlaterTyranus as mentioned elsewhere, your code breaks with valid HTML fragments but invalid XML. – MattDMo Jul 26 '15 at 23:58
@heinst you are absolutely wrong in your last comment. Your answer is not robust. **That** is why I downvoted. If it had worked, I wouldn't have done that. Whining about SO sucking because it's working as intended is pointless. – MattDMo Jul 27 '15 at 00:09
@MattDMo I'd also like to note that this isn't really an answer... I assume you're going to delete it or add an answer to it? – Slater Victoroff Jul 27 '15 at 00:09
@Slater and why isn't this an answer? – MattDMo Jul 27 '15 at 00:09

score 0 · Answer 3 · edited Jul 27 '15 at 00:12

0

I believe it can be done by beautifulsoup

from BeautifulSoup import BeautifulSoup

String = "</div><p></p> Random Other Text <h3>Example Text</h3><h3>Coachella Valley Music &amp; Arts Festival</h3><strong>Random Text</strong>:Random Date<br/>"

soup = BeautifulSoup(String)

input = 'Example Text'
for elem in soup(text=input):
    print(str(elem.parent).replace(input,'') )

edited Jul 27 '15 at 00:12

Slater Victoroff

21,376
21
85
144

answered Jul 27 '15 at 00:09

galaxyan

5,944
2
19
43

score -2 · Answer 4 · answered Jul 26 '15 at 23:46

-2

You can use the regex <[^>]*> to match a tag, then use groups defined with parentheses to separate your match into the blocks that you want:

m = re.search("(<[^>]*>)Example Text(<[^>]*>)", String)
m.groups()
Out[7]: ('<h3>', '</h3>')

answered Jul 26 '15 at 23:46

maxymoo

35,286
11
92
119

Identify characters in a string by their relative position to a searched string?

` is 4 characters, while `

4 Answers4