0

I need to extract the text between the and tag using regex in python.

Example: Customizable:<strong>Features Windows 10 Pro</strong> and legacy ports <b>including VGA,</b> HDMI, RJ-45, USB Type A connections.

For this i am doing:

pattern=re.compile("(<b>(.*?)</b>)|(<strong>(.*?)</strong>)")
for label in labels:
    print(label)
    flag=0
    if(('Window'in label or 'Windows' in label) and ('<b>' in label or '<strong>' in label)):
        text=re.findall(pattern, label)
        print(text)

where labels is the list of such html elements containing tag. The output expected is ['Features Windows 10','including VGA,']

Instead in a getting the outuput as: [('', 'Features Windows 10 Pro'), ('including VGA,', '')]

Please help. Thanks in advance.

DeepSpace
  • 78,697
  • 11
  • 109
  • 154

2 Answers2

6

Care for BeautifulSoup ?

from bs4 import BeautifulSoup

data = BeautifulSoup("""Customizable:<strong>Features Windows 10 Pro</strong> and legacy ports <b>including VGA,</b> HDMI, RJ-45, USB Type A connections""")

data.find_all('strong')[0].text
data.find_all('b')[0].text

Output

Features Windows 10 Pro
'including VGA,'
iamklaus
  • 3,720
  • 2
  • 12
  • 21
1

First you should not use regexes to parse markup text.

That being said, the result is by design. The documentation for re.findall is explicit about it (emphasize mine):

re.findall(pattern, string, flags=0)

Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.

Your pattern contains 2 groups, one for <b>, one for <strong>. You get two tuples so that you can know what group was matched.

If you do not like that, you could use finditer instead which will return a match object. And group(0) on the match object is the part of the string that was matched:

text = [m.group() for m in pattern.finditer(label)]
Community
  • 1
  • 1
Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252