0

How can I extract the content (how are you) from the string:

<string xmlns="http://schemas.microsoft.com/2003/10/Serialization/">how are you</string>. 

Can I use regex for the purpose? if possible whats suitable regex for it.

Note: I dont want to use split function for extract the result. Also can you suggest some links to learn regex for a beginner.

I am using python2.7.2

johnsyweb
  • 136,902
  • 23
  • 188
  • 247
Jisson
  • 3,566
  • 8
  • 38
  • 71
  • 1
    Can the string contain any XML escapes such as '&' or even a CDATA section? If so then you should extract the XML-like bit from the start of the string and use an XML parser. – Duncan Jan 27 '12 at 10:49

4 Answers4

2

You could use a regular expression for this (as Joey demonstrates).

However if your XML document is any bigger than this one-liner you could not since XML is not a regular language.

Use BeautifulSoup (or another XML parser) instead:

>>> from BeautifulSoup import BeautifulSoup
>>> xml_as_str = '<string xmlns="http://schemas.microsoft.com/2003/10/Serialization/">how are you</string>. '
>>> soup = BeautifulSoup(xml_as_str)
>>> print soup.text
how are you.

Or...

>>> for string_tag in soup.findAll('string'):
...     print string_tag.text
... 
how are you
Community
  • 1
  • 1
johnsyweb
  • 136,902
  • 23
  • 188
  • 247
  • 2
    Forgive me to correct you here, but a single XML element is definitely regular. XML only becomes non-regular if you introduce element nesting (but the name of the `string` element doesn't really imply arbitrary nesting, so this might be perfectly feasible). Also, isn't BeautifulSoup for parsing malformed HTML? Better to use an actual XML parser, I guess. – Joey Jan 27 '12 at 10:25
  • Thanks for the feedback: I've updated my wording. This string is not well-formed XML, either (the full-stop at the end). – johnsyweb Jan 27 '12 at 10:36
0
(?<=<string xmlns="http://schemas.microsoft.com/2003/10/Serialization/">)[^<]+(?=</string>)

would match what you want, as a trivial example.

(?<=<)[^<]+

would, too. It all depends a bit on how your input is formatted exactly.

Joey
  • 344,408
  • 85
  • 689
  • 683
0

Try with following regex:

/<[^>]*>(.*?)</
hsz
  • 148,279
  • 62
  • 259
  • 315
0

This will match a generic HTML tag (Replace "string" with the tag you want to match):

/<string[^<]*>(.*?)<\/string>/i

(i=case insensitive)

Kols
  • 83
  • 6