I've written a recursive function that'll return the XPATHs of all the texts in the tag in a dictionary with the following format:
{'xpath1': {'text': 'text1'}, 'xpath2': {'text': 'text2'}, ...}
Code:
from bs4 import BeautifulSoup, NavigableString
def get_xpaths_dict(soup, xpaths={}, curr_path=''):
curr_path += '/{}'.format(soup.name)
for item in soup.contents:
if isinstance(item, NavigableString):
if item.strip():
try:
xpaths[curr_path]['count'] += 1
count = xpaths[curr_path]['count']
curr_path += '[{}]'.format(count)
xpaths[curr_path] = {'text': item.strip()}
except KeyError:
xpaths[curr_path] = {'text': item.strip(), 'count': 1}
else:
xpaths = get_xpaths_dict(item, xpaths, curr_path)
return xpaths
html = '''<div>
text of div 1
<span>
text of span 1.1
<span>
text of span 2.1
</span>
<span>
text of span 2.2
<span>
text of span 3
</span>
</span>
</span>
</div>'''
soup = BeautifulSoup(html, 'html.parser')
xpaths = get_xpaths_dict(soup.div)
print(xpaths)
Output:
{'/div': {'text': 'text of div 1', 'count': 1}, '/div/span': {'text': 'text of span 1.1', 'count': 1}, '/div/span/span': {'text': 'text of span 2.1', 'count': 2}, '/div/span/span[2]': {'text': 'text of span 2.2'}, '/div/span/span[2]/span': {'text': 'text of span 3', 'count': 1}}
I know this is not the format in which you were expecting the output. But, you can convert this into any format you want. For example, to convert this into your expected output, simply do the following:
expected_output = [(v['text'], k) for k, v in xpaths.items()]
print(expected_output)
Output:
[('text of div 1', '/div'), ('text of span 1.1', '/div/span'), ('text of span 2.1', '/div/span/span'), ('text of span 2.2', '/div/span/span[2]'), ('text of span 3', '/div/span/span[2]/span')]
Some explanation:
The extra key count
in the dictionary is used to store the number of tags with the same name in the current tag. Using this format (dictionary) optimizes the code a lot. You will visit each tag only once.
Bonus:
As, the function returns a dictionary with XPATHs as the keys, you can get any text using an XPATH. For example:
xpaths = get_xpaths_dict(soup.div)
print(xpaths['/div/span/span[2]/span']['text'])
# text of span 3