1

I have a string which has defined tags around specific words or sub-strings. For example:

text = 'Bring me to <xxx>ibis and the</xxx> in <ccc>NW</ccc> and the <sss>Jan</sss> 
<hhh>10</hhh>'

How can I get the strings <xxx>ibis and the</xxx>, <ccc>NW</ccc>, <sss>Jan</sss> and <hhh>10</hhh>. These tags can be anything but the tags covering a word or few words will be similar. Also, if a start or end tag is missing, I don't want that string to be returned. For example:

text = 'Bring me to <xxx>ibis and the in NW</ccc> and the <sss>Jan</sss> 
<hhh>10</hhh>'

In this case, only <sss>Jan</sss> and <hhh>10</hhh> has to be returned.

idkman
  • 169
  • 1
  • 15

1 Answers1

2

Generally, you don't want regex to parse (X)HTML (more info in this answer) Better option is using a parser. This example is with beautifulsoup:

data = '''text = 'Bring me to <xxx>ibis and the</xxx> in <ccc>NW</ccc> and the <sss>Jan</sss>
<hhh>10</hhh>'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(data, 'html.parser')

for tag in soup.select('xxx, ccc, sss, hhh'):
    print(tag.get_text(strip=True))

Prints:

ibis and the
NW
Jan
10

EDIT: To get whole tag string:

for tag in soup.select('xxx, ccc, sss, hhh'):
    print(tag)

Prints:

<xxx>ibis and the</xxx>
<ccc>NW</ccc>
<sss>Jan</sss>
<hhh>10</hhh>

EDIT II: If you have list of tags to find:

list_of_tags = ['xxx', 'ccc', 'sss', 'hhh']
for tag in soup.find_all(list_of_tags):
    print(tag)

EDIT: In case of malformed HTML it's necessary to change the parser:

data = '''text = 'Bring me to <xxx>ibis and the in NW</ccc> and the <sss>Jan</sss>
<hhh>10</hhh>'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(data, 'lxml')

list_of_tags = ['xxx', 'ccc', 'sss', 'hhh']
for tag in soup.find_all(list_of_tags):
    if tag.find_all(list_of_tags):
        continue
    print(tag)

Prints:

<sss>Jan</sss>
<hhh>10</hhh>
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
  • Your answer is correct. But I am sorry I should have asked my question in another way. I edited it now, could you please check it? – idkman Jul 26 '19 at 11:50
  • How to give the name of the tags ? Because when I give them as a list I am getting `TypeError: unhashable type: 'list'`. – idkman Jul 26 '19 at 12:12
  • 2
    @Dennis.M You use method `find_all()` See my answer. The documentation for BeautifulSoup can be found here https://www.crummy.com/software/BeautifulSoup/bs4/doc/ – Andrej Kesely Jul 26 '19 at 12:27
  • How to avoid parsing the string if one of the tags is missing? – idkman Jul 31 '19 at 08:11
  • @Dennis.M What do you mean? Do you want to return all tags or nothing if just one is missing? – Andrej Kesely Jul 31 '19 at 08:13