python regex to match any valid english sentence

Question

I was wondering if it is possible to write a python regex to match it up with any valid English sentence which can have alphanumeric characters and special characters.
Basically, I wanted to extract some specific elements from an XML file. These specific elements will have the following form:

<p o=<Any Number>> <Any English sentence> </p>

For example:

<p o ="1"> The quick brown fox jumps over the lazy dog </p>

or

<p o ="2">  And This is a number 12.90! </p>

We can easily write regex for

<p o=<Any Number>>

and </p> tags. But I am interested in extracting the sentences lying in between these tags by writing regex group.

Can anyone please suggest a Regex to be used for the problem above?

Also, if you can suggest a workaround approach, then it will be really helpful to me as well.

[Here is good explanation](http://stackoverflow.com/a/1732454/458723) why you should use something like BeautifulSoup or lxml to parse XML rather than regexp. — Kirill, May 25 '12 at 11:06

Kien Truong · Accepted Answer · 2012-05-25T11:15:16.830

9

Use an XML parser like lxml, regex is not suitable for this task. Example:

import lxml.etree
// First we parse the xml
doc = lxml.etree.fromstring('<p o ="2">  And This is a number 12.90! </p>')
// Then we use xpath to extract the element we need
doc.xpath('/p/text()')

You can read more about XPATH at: Xpath tutorial.

edited May 25 '12 at 11:15

answered May 25 '12 at 11:06

Kien Truong

11,179
2
30
36

score 1 · Answer 2 · answered May 25 '12 at 11:08

1

You should use an xml parser really. Example here http://www.travisglines.com/web-coding/python-xml-parser-tutorial.

answered May 25 '12 at 11:08

Pete

582
5
8

python regex to match any valid english sentence

2 Answers2