Extract original string position from beautifulsoup element

Question

When parsing long complicated html documents with beautifulsoup, it's sometimes useful to get the exact position in the original string where I've matched an element. I can't simply search for the string, as there may be multiple matching elements and I would lose bs4's ability to parse the DOM. Given this minimal working example:

import bs4

html = "<div><b>Hello</b>  <i>World</i></div>"
soup = bs4.BeautifulSoup(html,'lxml')

# Returns 22
print html.find("World")

# How to get this to return 22?
print soup.find("i", text="World")

How can I get the element extracted by bs4 to return 22?

Of possible interest: [Get position/line number - Implemented?](https://groups.google.com/forum/#!topic/beautifulsoup/rF1gnwsd2e8), and SO Q&A [Obtaining position info when parsing HTML in Python](https://stackoverflow.com/q/28728498/2823755) — wwii, Jan 12 '18 at 17:14

score 1 · Answer 1 · answered Aug 17 '18 at 13:14

I understand your problem is "World" might be written many times, but you want to obtain the position of an specific occurrence (that you, somehow, know how to identify).

You can use this workaround. I bet there are more elegant solutions, but this should make it:

Given this html:

import bs4

html = """<div><b>Hello</b>  <i>World</i></div>
          <div><b>Hello</b>  <i>Foo World</i></div>
          <div><b>Hello</b>  <i>Bar World</i></div>"""

soup = bs4.BeautifulSoup(html,'lxml')

If we want to obtain the position of the Foo World occurence we can:

Get the tag
Introduce some unique string that we know it's not present in the rest of the html

Get the position of the string we added

import bs4

html = """<div><b>Hello</b>  <i>World</i></div>
          <div><b>Hello</b>  <i>Foo World</i></div>
          <div><b>Hello</b>  <i>Bar World</i></div>"""

soup = bs4.BeautifulSoup(html,'html.parser')

#1
desired_tag = soup.find("i", text="Foo World")
#2
desired_tag.insert(0, "some_unique_string")

print(str(soup))
"""
Will show:
<div><b>Hello</b> <i>World</i></div>
<div><b>Hello</b> <i>some_unique_stringFoo World</i></div>
<div><b>Hello</b> <i>Bar World</i></div>
"""

#3   
print(str(soup).find("some_unique_string"))
"""
58
"""

Extract original string position from beautifulsoup element

1 Answers1