Extract string using regex

Question

How can I extract the content (how are you) from the string:

<string xmlns="http://schemas.microsoft.com/2003/10/Serialization/">how are you</string>.

Can I use regex for the purpose? if possible whats suitable regex for it.

Note: I dont want to use split function for extract the result. Also can you suggest some links to learn regex for a beginner.

I am using python2.7.2

Can the string contain any XML escapes such as '&' or even a CDATA section? If so then you should extract the XML-like bit from the start of the string and use an XML parser. — Duncan, Jan 27 '12 at 10:49

score 2 · Accepted Answer · edited May 23 '17 at 11:56

2

You could use a regular expression for this (as Joey demonstrates).

However if your XML document is any bigger than this one-liner you could not since XML is not a regular language.

Use BeautifulSoup (or another XML parser) instead:

>>> from BeautifulSoup import BeautifulSoup
>>> xml_as_str = '<string xmlns="http://schemas.microsoft.com/2003/10/Serialization/">how are you</string>. '
>>> soup = BeautifulSoup(xml_as_str)
>>> print soup.text
how are you.

Or...

>>> for string_tag in soup.findAll('string'):
...     print string_tag.text
... 
how are you

edited May 23 '17 at 11:56

Community

1
1

answered Jan 27 '12 at 10:23

johnsyweb

136,902
23
188
247

2

Forgive me to correct you here, but a single XML element is definitely regular. XML only becomes non-regular if you introduce element nesting (but the name of the `string` element doesn't really imply arbitrary nesting, so this might be perfectly feasible). Also, isn't BeautifulSoup for parsing malformed HTML? Better to use an actual XML parser, I guess. – Joey Jan 27 '12 at 10:25
Thanks for the feedback: I've updated my wording. This string is not well-formed XML, either (the full-stop at the end). – johnsyweb Jan 27 '12 at 10:36

Joey · Answer 2 · 2012-01-27T10:30:21.933

0

(?<=<string xmlns="http://schemas.microsoft.com/2003/10/Serialization/">)[^<]+(?=</string>)

would match what you want, as a trivial example.

(?<=<)[^<]+

would, too. It all depends a bit on how your input is formatted exactly.

edited Jan 27 '12 at 10:30

answered Jan 27 '12 at 10:24

Joey

344,408
85
689
683

score 0 · Answer 3 · answered Jan 27 '12 at 10:24

0

Try with following regex:

/<[^>]*>(.*?)</

answered Jan 27 '12 at 10:24

hsz

148,279
62
259
315

score 0 · Answer 4 · answered Jan 27 '12 at 10:35

0

This will match a generic HTML tag (Replace "string" with the tag you want to match):

/<string[^<]*>(.*?)<\/string>/i

(i=case insensitive)

answered Jan 27 '12 at 10:35

Kols

83
6

Extract string using regex

4 Answers4

Use BeautifulSoup (or another XML parser) instead: