How to find multiple strings between tag/sub-strings?

Question

I have a string which has defined tags around specific words or sub-strings. For example:

text = 'Bring me to <xxx>ibis and the</xxx> in <ccc>NW</ccc> and the <sss>Jan</sss> 
<hhh>10</hhh>'

How can I get the strings <xxx>ibis and the</xxx>, <ccc>NW</ccc>, <sss>Jan</sss> and <hhh>10</hhh>. These tags can be anything but the tags covering a word or few words will be similar. Also, if a start or end tag is missing, I don't want that string to be returned. For example:

text = 'Bring me to <xxx>ibis and the in NW</ccc> and the <sss>Jan</sss> 
<hhh>10</hhh>'

In this case, only <sss>Jan</sss> and <hhh>10</hhh> has to be returned.

Why is this tagged [tag:nsregularexpression]? Are you running Python on iOS or something? — Mast, Jul 29 '19 at 12:01

Andrej Kesely · Accepted Answer · 2019-07-31T08:23:31.957

2

Generally, you don't want regex to parse (X)HTML (more info in this answer) Better option is using a parser. This example is with beautifulsoup:

data = '''text = 'Bring me to <xxx>ibis and the</xxx> in <ccc>NW</ccc> and the <sss>Jan</sss>
<hhh>10</hhh>'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(data, 'html.parser')

for tag in soup.select('xxx, ccc, sss, hhh'):
    print(tag.get_text(strip=True))

Prints:

ibis and the
NW
Jan
10

EDIT: To get whole tag string:

for tag in soup.select('xxx, ccc, sss, hhh'):
    print(tag)

Prints:

<xxx>ibis and the</xxx>
<ccc>NW</ccc>
<sss>Jan</sss>
<hhh>10</hhh>

EDIT II: If you have list of tags to find:

list_of_tags = ['xxx', 'ccc', 'sss', 'hhh']
for tag in soup.find_all(list_of_tags):
    print(tag)

EDIT: In case of malformed HTML it's necessary to change the parser:

data = '''text = 'Bring me to <xxx>ibis and the in NW</ccc> and the <sss>Jan</sss>
<hhh>10</hhh>'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(data, 'lxml')

list_of_tags = ['xxx', 'ccc', 'sss', 'hhh']
for tag in soup.find_all(list_of_tags):
    if tag.find_all(list_of_tags):
        continue
    print(tag)

Prints:

<sss>Jan</sss>
<hhh>10</hhh>

edited Jul 31 '19 at 08:23

answered Jul 26 '19 at 11:48

Andrej Kesely

168,389
15
48
91

Your answer is correct. But I am sorry I should have asked my question in another way. I edited it now, could you please check it? – idkman Jul 26 '19 at 11:50
How to give the name of the tags ? Because when I give them as a list I am getting `TypeError: unhashable type: 'list'`. – idkman Jul 26 '19 at 12:12
2

@Dennis.M You use method `find_all()` See my answer. The documentation for BeautifulSoup can be found here https://www.crummy.com/software/BeautifulSoup/bs4/doc/ – Andrej Kesely Jul 26 '19 at 12:27
How to avoid parsing the string if one of the tags is missing? – idkman Jul 31 '19 at 08:11
@Dennis.M What do you mean? Do you want to return all tags or nothing if just one is missing? – Andrej Kesely Jul 31 '19 at 08:13

How to find multiple strings between tag/sub-strings?

1 Answers1