Parsing XML in Python with regex

Question

I'm trying to use regex to parse an XML file (in my case this seems the simplest way).

For example a line might be:

line='<City_State>PLAINSBORO, NJ 08536-1906</City_State>'

To access the text for the tag City_State, I'm using:

attr = re.match('>.*<', line)

but nothing is being returned.

Can someone point out what I'm doing wrong?

I am compelled to link [this answer](http://stackoverflow.com/a/1732454/78845). — johnsyweb, Aug 11 '13 at 04:21
Using a proper XML library isn't hard once you find a library you like. I found [ElementTree](http://docs.python.org/2/library/xml.etree.elementtree.html) the nicest to use one in the standard library, and [untangle](https://github.com/stchris/untangle) the easiest (it converts XML into regular dictionaries/lists etc) — dbr, Aug 11 '13 at 04:32
>Can someone point out what I'm doing wrong? A: you are trying to parse XML using regular expressions. — Michael Kay, Aug 11 '13 at 12:10
I have tried ElementTree before and I am getting memory issues. The file size is 250Mb. Since the XML file I am parsing is very simple, I figured it is easier to use regex. — user2671656, Aug 11 '13 at 12:38
Try `etree.iterparse` (e.g. lxml.etree), or SAX events. Both have very small memory requirements. — xmedeko, Jul 03 '18 at 08:54
Reopening this question - it's a perfectly valid question as to why `re.match` isn't working. Linking the ubiquitous [X]HTML-regex question adds nothing — TerryA, Sep 01 '19 at 15:45

TerryA · Accepted Answer · 2013-08-11T04:25:09.137

You normally don't want to use re.match. Quoting from the docs:

If you want to locate a match anywhere in string, use search() instead (see also search() vs. match()).

Note:

>>> print re.match('>.*<', line)
None
>>> print re.search('>.*<', line)
<_sre.SRE_Match object at 0x10f666238>
>>> print re.search('>.*<', line).group(0)
>PLAINSBORO, NJ 08536-1906<

Also, why parse XML with regex when you can use something like BeautifulSoup :).

>>> from bs4 import BeautifulSoup as BS
>>> line='<City_State>PLAINSBORO, NJ 08536-1906</City_State>'
>>> soup = BS(line)
>>> print soup.find('city_state').text
PLAINSBORO, NJ 08536-1906

score 9 · Answer 2 · answered Aug 11 '13 at 09:43

9

Please, just use an XML parser like ElementTree

>>> from xml.etree import ElementTree as ET
>>> line='<City_State>PLAINSBORO, NJ 08536-1906</City_State>'
>>> ET.fromstring(line).text
'PLAINSBORO, NJ 08536-1906'

answered Aug 11 '13 at 09:43

Viktor Kerkez

45,070
12
104
85

score 0 · Answer 3 · answered Aug 11 '13 at 04:26

0

re.match returns a match only if the pattern matches the entire string. To find substrings matching the pattern, use re.search.

And yes, this is a simple way to parse XML, but I would highly encourage you to use a library specifically designed for the task.

answered Aug 11 '13 at 04:26

Kyle

178
1
8

It would only be "a simple way to parse XML" if it actually _did_ parse XML. Which it doesn't. (See: lack of support for detecting comment or CDATA blocks; for handling character entities; etc etc etc). – Charles Duffy Aug 11 '13 at 05:04
Minor point: `re.match` is left side anchored but does not have to consume the entire string. Very loosely, given regexp `X`, `re.match` is like `re.search` using `^X` (but not `^X$`). There are other differences, particularly with strings containing newlines; see documentation link in [Haidro's answer](http://stackoverflow.com/a/18168699/1256452). – torek Aug 11 '13 at 05:04

Parsing XML in Python with regex

3 Answers3

Linked