Python regex issue

Question

I'm trying to extract ALL phone screen resolutions from the WURFL XML file with the below Python script. The problem is that I only get the first match, though. Why? How could I get all matches?

The WURFL XML file can be found at http://sourceforge.net/projects/wurfl/files/WURFL/latest/wurfl-latest.zip/download?use_mirror=freefr

def read_file(file_name):
    f = open(file_name, 'rb')
    data = f.read()
    f.close()
    return data

text = read_file('wurfl.xml')

import re
pattern = '<device id="(.*?)".*actual_device_root="true">.*<capability name="resolution_width" value="(\d+)"/>.*<capability name="resolution_height" value="(\d+)"/>.*</device>'
for m in re.findall(pattern, text, re.DOTALL):
    print(m)

Tim Pietzcker · Answer 1 · 2011-05-11T12:12:24.370

1

First, use an XML parser instead of regular expressions. You'll be happier in the long run.

Second, if you insist on using regexes, use finditer() instead of findall().

Third, your regex matches from the first entry to the last one (the .* is greedy, and you have set DOTALL mode), so either see the first paragraph or at least change your regex to

pattern = r'<device id="(.*?)".*?actual_device_root="true">.*?<capability name="resolution_width" value="(\d+)"/>.*?<capability name="resolution_height" value="(\d+)"/>.*?</device>'

Also, always use raw strings with regexes. \d happens to work, \b will behave unexpectedly in a "normal" string, though.

edited May 11 '11 at 12:12

answered May 11 '11 at 11:29

Tim Pietzcker

328,213
58
503
561

Thanks for your input but I still only get one match – AOO May 11 '11 at 11:37
Oops, I overlooked one greedy quantifier. Please try again with the edited regex. – Tim Pietzcker May 11 '11 at 12:12

score 0 · Answer 2 · edited May 23 '17 at 12:04

0

This is an oddness in the behaviour of findall, specifically findall only returns the first matching group from each pattern match. See this question.

edited May 23 '17 at 12:04

Community

1
1

answered May 11 '11 at 11:10

verdesmarald

11,646
2
44
60

Clarification: I get the groups I'm interested in but findall only returns ONE match (the first match) for some reason. – AOO May 11 '11 at 11:23

score 0 · Answer 3 · answered May 11 '11 at 11:38

You are using "greedy" matches: .* will match as much text as it can grab, which means the .* before <capabilities> matches most of the file.

text = open('wurfl.xml').read()
pattern = r'<device id="(.*?)".*?actual_device_root="true">.*?<capability name="resolution_width" value="(\d+)"/>.*?<capability name="resolution_height" value="(\d+)"/>.*?</device>'
for m in re.findall(pattern, text, re.DOTALL):
    print m

score 0 · Answer 4 · answered May 11 '11 at 12:39

I'm certainly not averse to handling xml with a regexp if the requirements are simple, but perhaps in this case using a real xml parser would be better. Using the stdlib etree module and a sprinkling of (imho) hideous xpaths:

import xml.etree.ElementTree as ET

def capability_value(cap_elem):
    if cap_elem is None:
        return None
    return int(cap_elem.attrib.get('value'))

def devices(wurfl_doc):
    for el in wurfl_doc.findall("/devices/device[@actual_device_root='true']"):
        width = el.find("./group[@id='display']/capability[@name='resolution_width']")
        width = capability_value(width)
        height = el.find("./group[@id='display']/capability[@name='resolution_height']")
        height = capability_value(height)
        device = {
            'id' : el.attrib.get('id'), 
            'resolution' : {'width': width, 'height': height}
        }
        yield device

doc = ET.ElementTree(file='wurfl.xml')
for device in devices(doc):
    print device

Python regex issue

4 Answers4