5

I assume there's some beautiful Pythonic way to do this, but I haven't quite figured it out yet. Basically I'm looking to create a testing module and would like a nice simple way for users to define a character set to pull from. I could potentially concatenate a list of the various charsets associated with string, but that strikes me as a very unclean solution. Is there any way to get the charset that the regex represents?

Example:

def foo(regex_set):
    re.something(re.compile(regex_set))

foo("[a-z]")
>>> abcdefghijklmnopqrstuvwxyz

The compile is of course optional, but in my mind that's what this function would look like.

Slater Victoroff
  • 21,376
  • 21
  • 85
  • 144
  • Is the regex guaranteed to match one code-point or do you want the minimal alphabet that covers all symbols in the language specified by the regex? – Mike Samuel Jul 08 '13 at 19:33
  • im pretty sure you cant do that... at least not cleanly ... if its just one char you could bruteforce it but thats gross why not just use `string.ascii_lowercase`, etc – Joran Beasley Jul 08 '13 at 19:34
  • You'd need to create your own parser, and you'd probably only want to support a subset of regex syntax. I assume `[a-z](?<![a-hj-z])` isn't something you'd want to support. (That's an obfuscated way of saying `[i]`, in case you don't recognize the syntax.) – JDB Jul 08 '13 at 19:34
  • @Cyborgx37 the thought was that this would be exclusively for single character sets. – Slater Victoroff Jul 08 '13 at 19:35
  • You might want to look at [this](http://stackoverflow.com/a/22133/1578604) or [this](http://code.google.com/p/xeger/). – Jerry Jul 08 '13 at 19:36
  • 2
    Then just create your own syntax: `az` would mean "a to z". `aa` would mean "just a". That's not hard to do in any language. – JDB Jul 08 '13 at 19:37
  • @JoranBeasley if this was solely internal I would, but this is user-facing and I would prefer to have users make use of a very simple way to define character sets rather than have to learn Python's specific flavor of this. – Slater Victoroff Jul 08 '13 at 19:38
  • @Cyborgx37 And then, if you have need of other special character sets, just add other characters to your syntax, like `!` could mean consonants, and `@` could mean vowels, `A-B` could mean everything in `A` and not in `B`, `A|B` everything in `A` or `B`, `A&B` everything in `A` and `B`, `A^B` everything in `A` or `B` but not both, etc. – AJMansfield Jul 08 '13 at 19:39
  • Have you seen [Random string that matches a regexp](http://stackoverflow.com/q/205411) and [How to generate random strings that match a given regexp?](http://stackoverflow.com/q/748253) – jscs Jul 08 '13 at 19:39
  • @Cyborgx37 for simple example like that it's a pretty trivial problem and I certainly could build my own implementation, but I would actually like this to be extensible and function like regex character sets for the purpose of having a usable ux, and there are enough special cases that I would rather not reinvent the wheel on this one. – Slater Victoroff Jul 08 '13 at 19:43
  • @SlaterTyranus `but this is user-facing and I would prefer to have users make use of a very simple way to define character sets...` and you think that regex patterns are the simple way for users to define this? this sounds like you are asking for trouble – Joran Beasley Jul 08 '13 at 19:48
  • @JoranBeasley users being technically competent people, I personally would rather use something that is extremely prevalent and well-documented than something I made up for the occasion. – Slater Victoroff Jul 08 '13 at 19:49
  • 2
    @SlaterTyranus Have a list of letters, each with a check box next to it. Simple, prevalent, well documented functionality. – AJMansfield Jul 08 '13 at 19:52
  • @AJMansfield Brilliant! Straight up UI innovation there. – Slater Victoroff Jul 08 '13 at 19:52
  • I agree with AJ's solution, really anything other than a regex. what happens when a user enters `"[a-z][0-9]?."` or maybe you dont actually want to invert the regex... Im not sure why you want the charset from the regex... that might be the part that is wrong... regardless this sounds like a bad idea to me – Joran Beasley Jul 08 '13 at 19:53

4 Answers4

9

Paul McGuire, author of Pyparsing, has written an inverse regex parser, with which you could do this:

import invRegex
print(''.join(invRegex.invert('[a-z]')))
# abcdefghijklmnopqrstuvwxyz

If you do not want to install Pyparsing, there is also a regex inverter that uses only modules from the standard library with which you could write:

import inverse_regex
print(''.join(inverse_regex.ipermute('[a-z]')))
# abcdefghijklmnopqrstuvwxyz

Note: neither module can invert all regex patterns.


And there are differences between the two modules:

import invRegex
import inverse_regex
print(repr(''.join(invRegex.invert('.'))))
print(repr(''.join(inverse_regex.ipermute('.'))))

yields

'0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
'0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c'

Here is another difference, this time pyparsing enumerates a larger set of matches:

x = list(invRegex.invert('[a-z][0-9]?.'))
y = list(inverse_regex.ipermute('[a-z][0-9]?.'))
print(len(x))
# 26884
print(len(y))
# 1100

unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
2

A regex is not needed here. If you want to have users select a character set, let them just pick characters. As I said in my comment, simply listing all the characters and putting checkboxes by them would be sufficent. If you want something that is more compact, or just looks cooler, you could do something like one of these:

One way of displaying the letter selection. (green = selected) Another way of displaying the letter selection. (no x = selected Yet another way of displaying the letter selection. (black bg = selected)

Of course, if you actually use this, what you come up with will undoubtedly look better than these (And they will also actually have all the letters in them, not just "A").

If you need, you could include a button to invert the selection, select all, clear selection, save selection, or anything else you need to do.

AJMansfield
  • 4,039
  • 3
  • 29
  • 50
1

if its just simple ranges you could manually parse it

def range_parse(rng):
    min,max = rng.split("-")
    return "".join(chr(i) for i in range(ord(min),ord(max)+1))

print range_parse("a-z")+range_parse('A-Z')

but its gross ...

Joran Beasley
  • 110,522
  • 12
  • 160
  • 179
0

Another solution I thought of to simplify the problem:

Stick your own [ and ] on the line as part of the prompt, and disallow those characters in the input. After you scan the input and verify it doesn't contain anything matching [\[\]], you can prepend [ and append ] to the string, and use it like a regex against a string of all the characters needed ("abcdefghijklmnopqrstuvwxyz", fort instance).

AJMansfield
  • 4,039
  • 3
  • 29
  • 50