-1

I have been trying to use regex to parse through a XML-style string that looks like this:

Input
"Joe Doe got a <span class="procedure">X ray</span> <- in April blah blah <span <- class="disease">lacerations</span> blah <span <- class="anatomy">kidney</span>."

For each span I want to match three groups:
"<span class="blah">blah</span>" , class, textual content

For Example:
<span class="procedure">X ray</span>
the matches are:
<span class="procedure">X ray</span>, procedure, X ray

Till now I have been able to use re.search('<.+?>', xml) to find <span class="procedure">

Inspite of using re.search('<.+?>+, xml), I have no luck in finding the other strings instead it gave <span class="procedure">X ray</span> <- in April> which isn't the desired result either.

psn1997
  • 144
  • 9
lagn91
  • 83
  • 9
  • 2
    Related: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – David Zemens Sep 25 '19 at 13:52
  • I did research this topic well before posting this question & am well aware of the multitude of tools that would perform the required tasks much easier. However, it was posed to me to perform this task using regex which is clearly a bit of a challenge due to the limitations of regex and XML/HTML. – lagn91 Sep 25 '19 at 17:04
  • Cheers, glad someone was able to provide an answer before the topic was closed as a duplicate. For future advice: if you lay out what research you've done, and explain "Yes, I know I generally shouldn't do this, but I'm required to as part of a test/homework/etc." it will be less likely to be closed for any reason! – David Zemens Sep 25 '19 at 18:09
  • Ah I see, thanks for the info! Will be sure to apply next time! – lagn91 Sep 25 '19 at 19:35

1 Answers1

0

Regex is not the best for dealing with nested types like XML. But if your question is really that simple re.findall(r'<span.+?<\/span>', yourstring) will return your desired results.

Robert Kearns
  • 1,631
  • 1
  • 8
  • 15