Here's an approach to the problem (that doesn't use any regular expressions, although there's one place where it could). We split up the problem into two functions: one function which splits a string into comma-separated pieces and handles each piece (parseTags
), and one function which takes a string and processes it into a valid tag (sanitizeTag
). The annotated code is as follows:
# This function takes a string with commas separating raw user input, and
# returns a list of valid tags made by sanitizing the strings between the
# commas.
def parseTags(str):
# First, we split the string on commas.
rawTags = str.split(',')
# Then, we sanitize each of the tags. If sanitizing gives us back None,
# then the tag was invalid, so we leave those cases out of our final
# list of tags. We can use None as the predicate because sanitizeTag
# will never return '', which is the only falsy string.
return filter(None, map(sanitizeTag, rawTags))
# This function takes a single proto-tag---the string in between the commas
# that will be turned into a valid tag---and sanitizes it. It either
# returns an alphanumeric string (if the argument can be made into a valid
# tag) or None (if the argument cannot be made into a valid tag; i.e., if
# the argument contains only whitespace and/or punctuation).
def sanitizeTag(str):
# First, we turn non-alphanumeric characters into whitespace. You could
# also use a regular expression here; see below.
str = ''.join(c if c.isalnum() else ' ' for c in str)
# Next, we split the string on spaces, ignoring leading and trailing
# whitespace.
words = str.split()
# There are now three possibilities: there are no words, there was one
# word, or there were multiple words.
numWords = len(words)
if numWords == 0:
# If there were no words, the string contained only spaces (and/or
# punctuation). This can't be made into a valid tag, so we return
# None.
return None
elif numWords == 1:
# If there was only one word, that word is the tag, no
# post-processing required.
return words[0]
else:
# Finally, if there were multiple words, we camel-case the string:
# we lowercase the first word, capitalize the first letter of all
# the other words and lowercase the rest, and finally stick all
# these words together without spaces.
return words[0].lower() + ''.join(w.capitalize() for w in words[1:])
And indeed, if we run this code, we get:
>>> parseTags("tHiS iS a tAg, \t\n!&#^ , secondcomment , no!punc$$, ifNOSPACESthenPRESERVEcaps")
['thisIsATag', 'secondcomment', 'noPunc', 'ifNOSPACESthenPRESERVEcaps']
There are two points in this code that it's worth clarifying. First is the use of str.split()
in sanitizeTags
. This will turn a b c
into ['a','b','c']
, whereas str.split(' ')
would produce ['','a','b','c','']
. This is almost certainly the behavior you want, but there's one corner case. Consider the string tAG$
. The $
gets turned into a space, and is stripped out by the split; thus, this gets turned into tAG
instead of tag
. This is probably what you want, but if it isn't, you have to be careful. What I would do is change that line to words = re.split(r'\s+', str)
, which will split the string on whitespace but leave in the leading and trailing empty strings; however, I would also change parseTags
to use rawTags = re.split(r'\s*,\s*', str)
. You must make both these changes; 'a , b , c'.split(',') becomes ['a ', ' b ', ' c']
, which is not the behavior you want, whereas r'\s*,\s*'
deletes the space around the commas too. If you ignore leading and trailing white space, the difference is immaterial; but if you don't, then you need to be careful.
Finally, there's the non-use of regular expressions, and instead the use of str = ''.join(c if c.isalnum() else ' ' for c in str)
. You can, if you want, replace this with a regular expression. (Edit: I removed some inaccuracies about Unicode and regular expressions here.) Ignoring Unicode, you could replace this line with
str = re.sub(r'[^A-Za-z0-9]', ' ', str)
This uses [^...]
to match everything but the listed characters: ASCII letters and numbers. However, it's better to support Unicode, and it's easy, too. The simplest such approach is
str = re.sub(r'\W', ' ', str, flags=re.UNICODE)
Here, \W
matches non-word characters; a word character is a letter, a number, or the underscore. With flags=re.UNICODE
specified (not available before Python 2.7; you can instead use r'(?u)\W'
for earlier versions and 2.7), letters and numbers are both any appropriate Unicode characters; without it, they're just ASCII. If you don't want the underscore, you can add |_
to the regex to match underscores as well, replacing them with spaces too:
str = re.sub(r'\W|_', ' ', str, flags=re.UNICODE)
This last one, I believe, matches the behavior of my non-regex-using code exactly.
Also, here's how I'd write the same code without those comments; this also allows me to eliminate some temporary variables. You might prefer the code with the variables present; it's just a matter of taste.
def parseTags(str):
return filter(None, map(sanitizeTag, str.split(',')))
def sanitizeTag(str):
words = ''.join(c if c.isalnum() else ' ' for c in str).split()
numWords = len(words)
if numWords == 0:
return None
elif numWords == 1:
return words[0]
else:
return words[0].lower() + ''.join(w.capitalize() for w in words[1:])
To handle the newly-desired behavior, there are two things we have to do. First, we need a way to fix the capitalization of the first word: lowercase the whole thing if the first letter's lowercase, and lowercase everything but the first letter if the first letter's upper case. That's easy: we can just check directly. Secondly, we want to treat punctuation as completely invisible: it shouldn't uppercase the following words. Again, that's easy—I even discuss how to handle something similar above. We just filter out all the non-alphanumeric, non-whitespace characters rather than turning them into spaces. Incorporating those changes gives us
def parseTags(str):
return filter(None, map(sanitizeTag, str.split(',')))
def sanitizeTag(str):
words = filter(lambda c: c.isalnum() or c.isspace(), str).split()
numWords = len(words)
if numWords == 0:
return None
elif numWords == 1:
return words[0]
else:
words0 = words[0].lower() if words[0][0].islower() else words[0].capitalize()
return words0 + ''.join(w.capitalize() for w in words[1:])
Running this code gives us the following output
>>> parseTags("tHiS iS a tAg, AnD tHIs, \t\n!&#^ , se@%condcomment$ , No!pUnc$$, ifNOSPACESthenPRESERVEcaps")
['thisIsATag', 'AndThis', 'secondcomment', 'NopUnc', 'ifNOSPACESthenPRESERVEcaps']