-1

I have been trying to use the regex pattern >(\S.*?)<|#{1}\s+?(\w.*) with the method re.findall over the string

<h1 id="section">First Section</h1><a name="first section">
# Section_2

My expected result is two lists

["First Section"]
["Section_2"]

However, I get

["First Section",""]
["","Section_2"] 

Does someone knows what I am doing wrong?

Thanks,

ℂybernetician
  • 135
  • 2
  • 10
  • You should use [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) to play with HTML files – Plopp Oct 23 '18 at 08:37
  • Possible duplicate of [RegEx match open tags except XHTML self-contained tags](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – Cid Oct 23 '18 at 08:39
  • 1
    Your results are good, you ask for two capture groups so it gives you a tuple with two elements for each line. Except parsing HTML with regex, you did nothing wrong, you'll just need one more step to process the `re.findall()` results ;) – Plopp Oct 23 '18 at 08:43
  • *"...what I am doing wrong?"* You're using regex to parse HTML. This seem to be your main mistake :) – Andersson Oct 23 '18 at 08:44
  • Why do you expect your regular expression to match both groups at once? The OR in the RE only needs one of the groups to match. If you want to match both at the same time, remove the OR and restructure your RE. Also, for HTML, use an HTML parser. – Corion Oct 23 '18 at 08:44
  • 1
    [Don't parse HTML with regex!](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – Nordle Oct 23 '18 at 08:48

1 Answers1

1

This works for you particular case. I tried to keep more or less the same structure as your regular expression with some minor changes.

import re  
a = '<h1 id="section">First Section</h1><a name="first section">'
b = '# Section_2'
r = re.compile(r'((?<=>)\S.*?(?=<)|(?<=#{1}\s)\w.*)')

print(r.findall(a))
print(r.findall(b))

The reason why you get two outputs is because you have two capturing groups - (\S.*?) and (\w.*). Empty means that that group did not capture anything.

In the regular expression for the answer I only use one capturing group with an OR condition.

Carlos Azevedo
  • 660
  • 3
  • 13