How to get the corresponding chars with a pattern in Python

Question

I am now working on a function that can convert a bracket pattern like '[a-c]' to 'a', 'b' and 'c'.

i do not mean to do pattern matching in Python. I mean something that i can use '[a-c]' as input, and output the corresponding 'a', 'b' and 'c' which is valid matching chars for '[a-c]' in python regular expression. I want the matching chars.

we only have to consider [a-zA-Z0-9_-] as the valid chars in bracket.
No more modifiers like '*' or '+' or '?' considered.

However, it is very hard to write a robust one because we have so many situations to be considered. So, i want to know if there exists some tools to do this in Python?

Note: this one has some bug as noted by @swenzel. I have write a function to do that work. You can check it out in this Gist

I recommend the way @swenzel do in his second proposal. For more info about re.findall, please have a look at doc

You mean Regular Expressions? https://docs.python.org/2/library/re.html — Hitesh Dharamdasani, May 21 '15 at 10:44
@HiteshDharamdasani, no, i do not mean to do pattern matching. I mean something that i can use '[a-c]' as input, and output the corresponding 'a', 'b' and 'c' which is valid matching chars for '[a-c]' in python regular expression. — andy, May 21 '15 at 10:49
@andy So why is your question tagged with 'pattern-matching'? Describe what exactly you're trying to achieve. What should your function be able to expand? Only the `-` construct? — geckon, May 21 '15 at 10:50
@geckon, if that make the question no clearly, i have deleted the tag. — andy, May 21 '15 at 10:51
You are going to run into polynomial complexity very very quickly. Since you are going to have to either brute force(intelligently or not), The only way you can get all possible matches is to enumerate the problem space, And that can be very long. — Hitesh Dharamdasani, May 21 '15 at 10:53
@HiteshDharamdasani, so there is no available tools can do something like this? — andy, May 21 '15 at 10:55
Not that I am aware of What you are trying to do is close to something that was asked here http://stackoverflow.com/questions/2465719/generating-a-list-of-values-a-regex-could-match-in-python — Hitesh Dharamdasani, May 21 '15 at 10:59
@HiteshDharamdasani, yes, something like that. However, mine is more simple. **We only have to consider [a-zA-Z0-9_-] as the valid chars in bracket.** — andy, May 21 '15 at 11:01
Yup. That's all you need for things to go crazy. The only way i see this to be remotely possible is if you are not going to allow the '+' and '*' operator. and just parsing 'a' and 'c' out of [a-c] and looking up the alphabet table — Hitesh Dharamdasani, May 21 '15 at 11:04
@HiteshDharamdasani, yes, i just need the matching chars and no more modifiers like '+' or '*' will be considered. — andy, May 21 '15 at 11:05
@geckon, underscore character. we consider it is valid to be used in bracket pattern. — andy, May 21 '15 at 11:06
WARNING: This is just a demo. Donald Knuth will probably hit himself looking at this attempt https://gist.github.com/hiteshd/1482fe9ea22753f99744#file-madeness-py — Hitesh Dharamdasani, May 21 '15 at 11:25

swenzel · Accepted Answer · 2015-05-21T13:17:06.253

This sounds like homework... but so be it.
From what I understood, you need a parser for your range definition.
There you go:

def parseRange(rangeStr, i=0):
    # Recursion anchor, return empty set if we're out of bounds
    if i >= len(rangeStr):
        return set()

    # charSet will tell us later if we actually have a range here
    charSet = None

    # There can only be a range if we have more than 2 characters left in the
    # string and if the next character is a dash
    if i+2 < len(rangeStr) and rangeStr[i+1] == '-':

        # We might have a range. Valid ranges are between the following pairs of
        # characters
        pairs = [('a', 'z'), ('A', 'Z'), ('0', '9')]

        for lo, hi in pairs:
            # We now make use of the fact that characters are comparable.
            # Also the second character should come after the first, or be
            # the same which means e.g. 'a-a' -> 'a'
            if (lo <= rangeStr[i] <= hi) and \
               (rangeStr[i] <= rangeStr[i+2] <= hi):
                   # Retreive the set with all chars from the substring
                   charSet = parseRange(rangeStr, i+3)

                   # Extend the chars from the substring with the ones in this
                   # range.
                   # `range` needs integers, so we transform the chars to ints
                   # using ord and make use of the fact that their ASCII
                   # representation is ascending
                   charSet.update(chr(k) for k in
                           range(ord(rangeStr[i]), 1+ord(rangeStr[i+2])))
                   break

    # If charSet is not yet defined this means that at the current position
    # there is not a valid range definition. So we just get all chars for the
    # following subset and add the current char
    if charSet is None:
        charSet = parseRange(rangeStr, i+1)
        charSet.add(rangeStr[i])

    # Return the char set with all characters defined within rangeStr[i:]
    return charSet

It might not be the most elegant parser but it works. Also you have to strip the square brackets when calling it, but you can do that easily e.g. with slicing [1:-1].

Another very short, dump and easy solution using the parser from re is this:

def parseRangeRe(rangeStr):
    master_pattern = "1234567890abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ_-"
    matcher = re.compile(rangeStr)
    return set(matcher.findall(master_pattern))

i have write a similar one with your first function in [Gist](https://gist.github.com/hiteshd/1482fe9ea22753f99744#comment-1458140). I have not thought about the second one you proposed. It is robust but i think it will be slow with each regular expression we will match (26 + 26 + 10 + 2) times. — andy, May 21 '15 at 13:02
@andy checking 64 times for a match is far from being slow, especially since you have only one char to match. And none of your test cases contains more than two range definitions. The only problem here is that you cannot influence what is valid and what is not. E. g. `'[3cN-_]'` is perfectly valid regex but you want it to throw an exception which is not going to happen using `re` — swenzel, May 21 '15 at 13:10
you are right about the `[3cN-_]`. As i have said it is hard to write a robust one myself. Thanks for the tip and i think i will adopt the second one you have proposed which is always do its good job and robust. At last, correct comes first before speed. :) — andy, May 22 '15 at 02:23

geckon · Answer 2 · 2015-05-21T12:32:27.510

This is a simple solution that might work for you:

import re
import string

def expand(pattern):
    """
    Returns a list of characters that can be matched by the given pattern.
    """
    pattern = pattern[1:-1] # ignore the leading '[' and trailing ']'
    result = []
    lower_range_re = re.compile('[a-z]-[a-z]')
    upper_range_re = re.compile('[A-Z]-[A-Z]')
    digit_range_re = re.compile('[0-9]-[0-9]')

    for match in lower_range_re.findall(pattern):
        result.extend(string.ascii_lowercase[string.ascii_lowercase.index(match[0]):string.ascii_lowercase.index(match[2]) + 1])
    for match in upper_range_re.findall(pattern):
        result.extend(string.ascii_uppercase[string.ascii_uppercase.index(match[0]):string.ascii_uppercase.index(match[2]) + 1])
    for match in digit_range_re.findall(pattern):
        result.extend(string.digits[string.digits.index(match[0]):string.digits.index(match[2]) + 1])
    return result

It should work for patterns like [b-g], [0-3], [G-N], [b-gG-N1-3], etc. It won't work for patterns like [abc], [0123], etc.

score 0 · Answer 3 · edited May 23 '17 at 11:51

0

This solution does not require regex so it might be wrong, but it is possible to:

pattern = '[a-c]'
excludes = '[-]' # Or use includes if that is easier
result = []
for char in pattern:
    if char not in excludes: # if char in includes:
        result.append(char)
        print char

or take a look here: range over character in python

edited May 23 '17 at 11:51

Community

1
1

answered May 21 '15 at 11:33

Rockbot

935
1
6
24

I don't think it does what the OP wants. This just returns the boundaries, doesn't it? – geckon May 21 '15 at 12:33

How to get the corresponding chars with a pattern in Python

3 Answers3