11

I'm currently building a tool that will have to match filenames against a pattern. For convenience, I intend to provide both lazy matching (in a glob-like fashion) and regexp matching. For example, the following two snippets would eventually have the same effects:

@mylib.rule('static/*.html')
def myfunc():
    pass

@mylib.rule(r'^static/([^/]+)\.html')
def myfunc():
    pass

AFAIK r'' is only useful to the Python parser and it actually creates a standard str instance after parsing (the only difference being that it keeps the \).

Is anybody aware of a way to tell one from another?

I would hate to have to provide two alternate decorators for the same purpose or, worse, resorting manually parsing the string to determine if it's a regexp or not.

martineau
  • 119,623
  • 25
  • 170
  • 301
saalaa
  • 1,255
  • 2
  • 9
  • 11
  • Needless to say, I have inspected fields of both strings and raw strings, read the docs and Googled before posting. – saalaa May 06 '11 at 19:15
  • 4
    Another option would be to pass a compiled RE, e.g. `@mylib.rule(re.compile(r'^stat...'))` – intuited May 06 '11 at 19:16
  • "manually parsing the string to determine if it's a regexp"? How would you do that? Would just you just assume it's a rexep and if it didn't compile, assume it was "fnmatch" pattern? How could I override both of these choices to make it clear that my filename really had a `*` in it? What if I had `\\ ` in my filenames and was forced to use `r"` strings to provide an "fnmatch" pattern? What would I do? – S.Lott May 06 '11 at 19:22
  • @intuited Incidentally, this is comparable to the method that [Sinatra](http://www.sinatrarb.com/intro) uses. – Josh Lee May 06 '11 at 19:32
  • @jleedev: Cool! I guess it would be a bit nicer because of ruby having regexp literals. – intuited May 06 '11 at 20:27
  • @S.Lott using heuristics to tell one from another along with a way of letting the user explicitly decide is not terribly difficult. I didn't think too much about it but any string with characters in `[]()` would _almost_ certainly be a regexp under normal circumstances. As long as it's documented, it would be just fine. But as I said, I will use two decorators for greater clarity. – saalaa May 07 '11 at 09:18
  • What I was after was actually some property in the resulting `str` that would be help me telling one from another. – saalaa May 07 '11 at 09:21

3 Answers3

17

You can't tell them apart. Every raw string literal could also be written as a standard string literal (possibly requiring more quoting) and vice versa. Apart from this, I'd definitely give different names to the two decorators. They don't do the same things, they do different things.

Example (CPython):

>>> a = r'^static/([^/]+)\.html'; b = '^static/([^/]+)\.html'
>>> a is b
True

So in this particular example, the raw string literal and the standard string literal even result in the same string object.

Sven Marnach
  • 574,206
  • 118
  • 941
  • 841
  • 4
    Consequently, the entire design idea is flawed. The user *must* explicitly declare if it's fnmatch, regexp or neither. No assumptions, no guessing. – S.Lott May 06 '11 at 19:25
  • You code example is irrelevant as it doesn't highlight what r'' is actually doing. Also comparing strings with 'is' might give some surprises although it seems to work in this simple case. Trying with a = r'^static/([^/]+)\\.html'; b = '^static/([^/]+)\\.html' would be more insightful. Otherwise, I provide two decorators for now. – saalaa May 07 '11 at 09:03
  • 4
    @saalaa: The example is meant to show that raw string literals and standard string literals might be indistinguishable. Using `==` here instead of `is` wouldn't show anything, since for example `"a" == u"a"` yields `True`, while it is perfectly possible to tell `str` objects from `unicode` objects (in Python 2.x). To give a valid counter example here, I had to give an example where the standard string literal and the raw string literal happen to end up being *the same object*. – Sven Marnach May 07 '11 at 11:23
  • 1
    Although I understand your point, there is more behind this than checking for object identity. It's the compiler which recognizes that the two strings end up with the same content and thus -- as an optimization more than a _normal_, documented or expected behaviour (different implementation might not work in the same way) -- ends up using the _same_ object. There's a very [interesting post](http://stackoverflow.com/questions/2858603/python-why-is-keyword-has-different-behavior-when-there-is-dot-in-the-string#answer-2858669) on SO about that subject. – saalaa May 07 '11 at 19:30
  • 2
    @saalaa: I'm perfectly aware of this -- that's why I explicitly noted that this is a CPython example. I'm also aware that the example will give a different result if you simply put the two assignments on two separate lines. But all this does not matter here -- showing a single example in which a standard string literal and a raw string literal end up being the same object proofs that you won't be able to tell them apart in general, no matter how this single example comes about. – Sven Marnach May 07 '11 at 21:05
12

You can't tell whether a string was defined as a raw string after the fact. Personally, I would in fact use a separate decorator, but if you don't want to, you could use a named parameter (e.g. @rule(glob="*.txt") for globs and @rule(re=r".+\.txt") for regex).

Alternatively, require users to provide a compiled regular expression object if they want to use a regex, e.g. @rule(re.compile(r".+\.txt")) -- this is easy to detect because its type is different.

kindall
  • 178,883
  • 35
  • 278
  • 309
3

The term "raw string" is confusing because it sounds like it is a special type of string - when in fact, it is just a special syntax for literals that tells the compiler to do no interpretation of '\' characters in the string. Unfortunately, the term was coined to describe this compile-time behavior, but many beginners assume it carries some special runtime characteristics.

I prefer to call them "raw string literals", to emphasize that it is their definition of a string literal using a don't-interpret-backslashes syntax that is what makes them "raw". Both raw string literals and normal string literals create strings (or strs), and the resulting variables are strings like any other. The string created by a raw string literal is equivalent in every way to the same string defined non-raw-ly using escaped backslashes.

PaulMcG
  • 62,419
  • 16
  • 94
  • 130