16

I'm trying to use regex to parse an XML file (in my case this seems the simplest way).

For example a line might be:

line='<City_State>PLAINSBORO, NJ 08536-1906</City_State>'

To access the text for the tag City_State, I'm using:

attr = re.match('>.*<', line)

but nothing is being returned.

Can someone point out what I'm doing wrong?

user2671656
  • 173
  • 1
  • 1
  • 4
  • 11
    I am compelled to link [this answer](http://stackoverflow.com/a/1732454/78845). – johnsyweb Aug 11 '13 at 04:21
  • Using a proper XML library isn't hard once you find a library you like. I found [ElementTree](http://docs.python.org/2/library/xml.etree.elementtree.html) the nicest to use one in the standard library, and [untangle](https://github.com/stchris/untangle) the easiest (it converts XML into regular dictionaries/lists etc) – dbr Aug 11 '13 at 04:32
  • Dang, @Johnsyweb beat me to it! – torek Aug 11 '13 at 04:58
  • >Can someone point out what I'm doing wrong? A: you are trying to parse XML using regular expressions. – Michael Kay Aug 11 '13 at 12:10
  • I have tried ElementTree before and I am getting memory issues. The file size is 250Mb. Since the XML file I am parsing is very simple, I figured it is easier to use regex. – user2671656 Aug 11 '13 at 12:38
  • Try `etree.iterparse` (e.g. lxml.etree), or SAX events. Both have very small memory requirements. – xmedeko Jul 03 '18 at 08:54
  • Reopening this question - it's a perfectly valid question as to why `re.match` isn't working. Linking the ubiquitous [X]HTML-regex question adds nothing – TerryA Sep 01 '19 at 15:45

3 Answers3

22

You normally don't want to use re.match. Quoting from the docs:

If you want to locate a match anywhere in string, use search() instead (see also search() vs. match()).

Note:

>>> print re.match('>.*<', line)
None
>>> print re.search('>.*<', line)
<_sre.SRE_Match object at 0x10f666238>
>>> print re.search('>.*<', line).group(0)
>PLAINSBORO, NJ 08536-1906<

Also, why parse XML with regex when you can use something like BeautifulSoup :).

>>> from bs4 import BeautifulSoup as BS
>>> line='<City_State>PLAINSBORO, NJ 08536-1906</City_State>'
>>> soup = BS(line)
>>> print soup.find('city_state').text
PLAINSBORO, NJ 08536-1906
TerryA
  • 58,805
  • 11
  • 114
  • 143
9

Please, just use an XML parser like ElementTree

>>> from xml.etree import ElementTree as ET
>>> line='<City_State>PLAINSBORO, NJ 08536-1906</City_State>'
>>> ET.fromstring(line).text
'PLAINSBORO, NJ 08536-1906'
Viktor Kerkez
  • 45,070
  • 12
  • 104
  • 85
0

re.match returns a match only if the pattern matches the entire string. To find substrings matching the pattern, use re.search.

And yes, this is a simple way to parse XML, but I would highly encourage you to use a library specifically designed for the task.

Kyle
  • 178
  • 1
  • 8
  • It would only be "a simple way to parse XML" if it actually _did_ parse XML. Which it doesn't. (See: lack of support for detecting comment or CDATA blocks; for handling character entities; etc etc etc). – Charles Duffy Aug 11 '13 at 05:04
  • Minor point: `re.match` is left side anchored but does not have to consume the entire string. Very loosely, given regexp `X`, `re.match` is like `re.search` using `^X` (but not `^X$`). There are other differences, particularly with strings containing newlines; see documentation link in [Haidro's answer](http://stackoverflow.com/a/18168699/1256452). – torek Aug 11 '13 at 05:04