0

hello python experts !

I have strings like,

1. <li class="sli">First Session </li>
2. <li class="sli">Used <a class="xref" href="GUID-EEEEEE123-9ADD-E992-A982-CJHKL15414C-RTYBFDG.html">to initiate python session </li>

To,

1. First Session 
2. Used to initiate python session

Could you please help me with reg ex? I am trying to create reg ex which will select "<" , ">" and whatever come within this"<",">"

please help.

roippi
  • 25,533
  • 4
  • 48
  • 73
  • 1
    see: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – kaveman Dec 11 '13 at 17:39
  • Please take a look at this answer [http://stackoverflow.com/a/1732454/257501][1] [1]: http://stackoverflow.com/a/1732454/257501 – Saša Šijak Dec 11 '13 at 17:55
  • The knee-jerk "never use regex for anything related to HTML" responses are frankly incorrect. If he has a list of strings that he wants to strip out all tags from, `re` is the correct tool for the job. – roippi Dec 11 '13 at 18:02
  • I'm in the anti-regex camp, BeautifulSoup should do it simply with `.string`. You should clarify if the `1. ` and `2. ` are part of the strings though, as I simply presumed they weren't (BeautifulSoup would handle it either way). – Prashant Kumar Dec 11 '13 at 20:49

2 Answers2

1

This is a well-bounded problem so using regex for this operation is actually fine. You cannot reasonably parse arbitrary html with regex, but you can easily strip out all of the tags in those strings.

Given that all tags are closed, this should work:

import re
pat = re.compile(r'<.*?>')

s = '2. <li class="sli">Used <a class="xref" href="GUID-EEEEEE123-9ADD-E992-A982-CJHKL15414C-RTYBFDG.html">to initiate python session </li>'

pat.sub('', s)
Out[15]: '2. Used to initiate python session '

The key is making the part in between the <> braces lazily match.

roippi
  • 25,533
  • 4
  • 48
  • 73
0

You can use re.sub if you really need to do it via a regular expression.

In your case it will be:

result= re.sub(r'(<).*?(>)','',data)

where data is the string that contains what you want removed and result is the output string

However as Max Noel said it is normally better to use some HTML parser.