1

I was wondering if it is possible to write a python regex to match it up with any valid English sentence which can have alphanumeric characters and special characters.
Basically, I wanted to extract some specific elements from an XML file. These specific elements will have the following form:

<p o=<Any Number>> <Any English sentence> </p>  

For example:

<p o ="1"> The quick brown fox jumps over the lazy dog </p>

or

<p o ="2">  And This is a number 12.90! </p>

We can easily write regex for

<p o=<Any Number>>

and </p> tags. But I am interested in extracting the sentences lying in between these tags by writing regex group.

Can anyone please suggest a Regex to be used for the problem above?

Also, if you can suggest a workaround approach, then it will be really helpful to me as well.

Sergiu Dumitriu
  • 11,455
  • 3
  • 39
  • 62
swap310
  • 768
  • 2
  • 8
  • 22
  • 3
    [Here is good explanation](http://stackoverflow.com/a/1732454/458723) why you should use something like BeautifulSoup or lxml to parse XML rather than regexp. – Kirill May 25 '12 at 11:06

2 Answers2

9

Use an XML parser like lxml, regex is not suitable for this task. Example:

import lxml.etree
// First we parse the xml
doc = lxml.etree.fromstring('<p o ="2">  And This is a number 12.90! </p>')
// Then we use xpath to extract the element we need
doc.xpath('/p/text()')

You can read more about XPATH at: Xpath tutorial.

Kien Truong
  • 11,179
  • 2
  • 30
  • 36
1

You should use an xml parser really. Example here http://www.travisglines.com/web-coding/python-xml-parser-tutorial.

Pete
  • 582
  • 5
  • 8