Python regular expression to remove space and capitalize letters where the space was?

Question

I want to create a list of tags from a user supplied single input box, separated by comma's and I'm looking for some expression(s) that can help automate this.

What I want is to supply the input field and:

remove all double+ whitespaces, tabs, new lines (leaving just single spaces)
remove ALL (single's and double+) quotation marks, except for comma's, which there can be only one of
in between each comma, i want Something Like Title Case, but excluding the first word and not at all for single words, so that when the last spaces are removed, the tag comes out as 'somethingLikeTitleCase' or just 'something' or 'twoWords'
and finally, remove all remaining spaces

Here's what I have gathered around SO so far:

def no_whitespace(s):
"""Remove all whitespace & newlines. """
    return re.sub(r"(?m)\s+", "", s)


# remove spaces, newlines, all whitespace
# http://stackoverflow.com/a/42597/523051

  tag_list = ''.join(no_whitespace(tags_input))

# split into a list at comma's

  tag_list = tag_list.split(',')

# remove any empty strings (since I currently don't know how to remove double comma's)
# http://stackoverflow.com/questions/3845423/remove-empty-strings-from-a-list-of-strings

  tag_list = filter(None, tag_list)

I'm lost though when it comes to modifying that regex to remove all the punctuation except comma's and I don't even know where to begin for the capitalizing.

Any thoughts to get me going in the right direction?

As suggested, here are some sample inputs = desired_outputs

form: 'tHiS iS a tAg, 'whitespace' !&#^ , secondcomment , no!punc$$, ifNOSPACESthenPRESERVEcaps' should come out as ['thisIsATag', 'secondcomment', 'noPunc', 'ifNOSPACESthenPRESERVEcaps']

Could you give some example inputs and outputs that we can use to help understand the problem and to test our solutions? — Mark Byers, Aug 22 '12 at 21:33
i have added desired input/output to the bottom of the question. — chrickso, Aug 22 '12 at 21:59
can you explain why `'secondcomment'` isn't `'seCondcomment'` or `'no!punc$$'` isn't `nopunch`? — Burhan Khalid, Aug 22 '12 at 22:04
secondcomment has no spaces/punctuation so is left unchanged. 'no!punc$$' should come out nopunc according to my original description, but on thinking of outputs I decided i'd like them treated like spaces if surrounded by words. — chrickso, Aug 22 '12 at 22:05
are you treating all non-letters as spaces, for instance should "no!pUnc$$ " be noPunc or noPUnc? — MWB, Aug 22 '12 at 22:06
ya your right, if punc's are spaces, caps should not be preserved. "no!pUnc$$ " = "noPunc" — chrickso, Aug 23 '12 at 00:13

Antal Spector-Zabusky · Accepted Answer · 2012-09-08T22:52:33.127

Here's an approach to the problem (that doesn't use any regular expressions, although there's one place where it could). We split up the problem into two functions: one function which splits a string into comma-separated pieces and handles each piece (parseTags), and one function which takes a string and processes it into a valid tag (sanitizeTag). The annotated code is as follows:

# This function takes a string with commas separating raw user input, and
# returns a list of valid tags made by sanitizing the strings between the
# commas.
def parseTags(str):
    # First, we split the string on commas.
    rawTags = str.split(',')

    # Then, we sanitize each of the tags.  If sanitizing gives us back None,
    # then the tag was invalid, so we leave those cases out of our final
    # list of tags.  We can use None as the predicate because sanitizeTag
    # will never return '', which is the only falsy string.
    return filter(None, map(sanitizeTag, rawTags))

# This function takes a single proto-tag---the string in between the commas
# that will be turned into a valid tag---and sanitizes it.  It either
# returns an alphanumeric string (if the argument can be made into a valid
# tag) or None (if the argument cannot be made into a valid tag; i.e., if
# the argument contains only whitespace and/or punctuation).
def sanitizeTag(str):
    # First, we turn non-alphanumeric characters into whitespace.  You could
    # also use a regular expression here; see below.
    str = ''.join(c if c.isalnum() else ' ' for c in str)

    # Next, we split the string on spaces, ignoring leading and trailing
    # whitespace.
    words = str.split()

    # There are now three possibilities: there are no words, there was one
    # word, or there were multiple words.
    numWords = len(words)
    if numWords == 0:
        # If there were no words, the string contained only spaces (and/or
        # punctuation).  This can't be made into a valid tag, so we return
        # None.
        return None
    elif numWords == 1:
        # If there was only one word, that word is the tag, no
        # post-processing required.
        return words[0]
    else:
        # Finally, if there were multiple words, we camel-case the string:
        # we lowercase the first word, capitalize the first letter of all
        # the other words and lowercase the rest, and finally stick all
        # these words together without spaces.
        return words[0].lower() + ''.join(w.capitalize() for w in words[1:])

And indeed, if we run this code, we get:

>>> parseTags("tHiS iS a tAg, \t\n!&#^ , secondcomment , no!punc$$, ifNOSPACESthenPRESERVEcaps")
['thisIsATag', 'secondcomment', 'noPunc', 'ifNOSPACESthenPRESERVEcaps']

There are two points in this code that it's worth clarifying. First is the use of str.split() in sanitizeTags. This will turn a b c into ['a','b','c'], whereas str.split(' ') would produce ['','a','b','c','']. This is almost certainly the behavior you want, but there's one corner case. Consider the string tAG$. The $ gets turned into a space, and is stripped out by the split; thus, this gets turned into tAG instead of tag. This is probably what you want, but if it isn't, you have to be careful. What I would do is change that line to words = re.split(r'\s+', str), which will split the string on whitespace but leave in the leading and trailing empty strings; however, I would also change parseTags to use rawTags = re.split(r'\s*,\s*', str). You must make both these changes; 'a , b , c'.split(',') becomes ['a ', ' b ', ' c'], which is not the behavior you want, whereas r'\s*,\s*' deletes the space around the commas too. If you ignore leading and trailing white space, the difference is immaterial; but if you don't, then you need to be careful.

Finally, there's the non-use of regular expressions, and instead the use of str = ''.join(c if c.isalnum() else ' ' for c in str). You can, if you want, replace this with a regular expression. (Edit: I removed some inaccuracies about Unicode and regular expressions here.) Ignoring Unicode, you could replace this line with

str = re.sub(r'[^A-Za-z0-9]', ' ', str)

This uses [^...] to match everything but the listed characters: ASCII letters and numbers. However, it's better to support Unicode, and it's easy, too. The simplest such approach is

str = re.sub(r'\W', ' ', str, flags=re.UNICODE)

Here, \W matches non-word characters; a word character is a letter, a number, or the underscore. With flags=re.UNICODE specified (not available before Python 2.7; you can instead use r'(?u)\W' for earlier versions and 2.7), letters and numbers are both any appropriate Unicode characters; without it, they're just ASCII. If you don't want the underscore, you can add |_ to the regex to match underscores as well, replacing them with spaces too:

str = re.sub(r'\W|_', ' ', str, flags=re.UNICODE)

This last one, I believe, matches the behavior of my non-regex-using code exactly.

Also, here's how I'd write the same code without those comments; this also allows me to eliminate some temporary variables. You might prefer the code with the variables present; it's just a matter of taste.

def parseTags(str):
    return filter(None, map(sanitizeTag, str.split(',')))

def sanitizeTag(str):
    words    = ''.join(c if c.isalnum() else ' ' for c in str).split()
    numWords = len(words)
    if numWords == 0:
        return None
    elif numWords == 1:
        return words[0]
    else:
        return words[0].lower() + ''.join(w.capitalize() for w in words[1:])

To handle the newly-desired behavior, there are two things we have to do. First, we need a way to fix the capitalization of the first word: lowercase the whole thing if the first letter's lowercase, and lowercase everything but the first letter if the first letter's upper case. That's easy: we can just check directly. Secondly, we want to treat punctuation as completely invisible: it shouldn't uppercase the following words. Again, that's easy—I even discuss how to handle something similar above. We just filter out all the non-alphanumeric, non-whitespace characters rather than turning them into spaces. Incorporating those changes gives us

def parseTags(str):
    return filter(None, map(sanitizeTag, str.split(',')))

def sanitizeTag(str):
    words    = filter(lambda c: c.isalnum() or c.isspace(), str).split()
    numWords = len(words)
    if numWords == 0:
        return None
    elif numWords == 1:
        return words[0]
    else:
        words0 = words[0].lower() if words[0][0].islower() else words[0].capitalize()
        return words0 + ''.join(w.capitalize() for w in words[1:])

Running this code gives us the following output

>>> parseTags("tHiS iS a tAg, AnD tHIs, \t\n!&#^ , se@%condcomment$ , No!pUnc$$, ifNOSPACESthenPRESERVEcaps")
['thisIsATag', 'AndThis', 'secondcomment', 'NopUnc', 'ifNOSPACESthenPRESERVEcaps']

this is working great so far. note: i have modified the elif numWords == 1: to lowercase the first letter only to stay consistant with multi-word tags: return words[0][0].lower()+words[0][1:] — chrickso, Aug 23 '12 at 00:55
hey, i've been using this for awhile now and it's working great. i have decided though that i do not want to change the first letter to lowercase anymore. if it was lower, keep it that way and if Upper, keep it that way. Also, having punctuation capitalize the next word is turning out to be undersirable. I would like if only spaces make a capital. Are these quick/modifications you could make to the above code? thanks! — chrickso, Sep 07 '12 at 19:36
@chrickso: Sure, those are both quick; see my edit. In fact, they're sufficiently quick, I bet you could have figured them out yourself! :-) There's one caveat: your short description of your problems was sufficiently informal that I might have misunderstood something, particularly in an edge case. But I'm confident that you can modify the code I've produced to handle anything I happened to miss in your problem description. Nevertheless, if you've found my answer helpful, why not upvote/accept it? — Antal Spector-Zabusky, Sep 08 '12 at 22:52
looks good! was there an edge case you have already found that would not function properly? you have certainly earned your upvote and you have my gratitude :) — chrickso, Sep 09 '12 at 05:58
There weren't edge cases that didn't work, but there are some underspecified cases where you need to decide which behavior you want (what should `xYZ$` do?). At any rate, I'm glad I could help out! — Antal Spector-Zabusky, Sep 09 '12 at 09:47

score 1 · Answer 2 · answered Aug 22 '12 at 22:44

You could use a white list of characters allowed to be in a word, everything else is ignored:

import re

def camelCase(tag_str):
    words = re.findall(r'\w+', tag_str)
    nwords = len(words)
    if nwords == 1:
        return words[0] # leave unchanged
    elif nwords > 1: # make it camelCaseTag
        return words[0].lower() + ''.join(map(str.title, words[1:]))
    return '' # no word characters

This example uses \w word characters.

Example

tags_str = """ 'tHiS iS a tAg, 'whitespace' !&#^ , secondcomment , no!punc$$, 
ifNOSPACESthenPRESERVEcaps' """
print("\n".join(filter(None, map(camelCase, tags_str.split(',')))))

Output

thisIsATag
whitespace
secondcomment
noPunc
ifNOSPACESthenPRESERVEcaps

MWB · Answer 3 · 2012-08-22T22:28:42.300

0

I think this should work

def toCamelCase(s):
  # remove all punctuation
  # modify to include other characters you may want to keep
  s = re.sub("[^a-zA-Z0-9\s]","",s)

  # remove leading spaces
  s = re.sub("^\s+","",s)

  # camel case
  s = re.sub("\s[a-z]", lambda m : m.group(0)[1].upper(), s)

  # remove all punctuation and spaces
  s = re.sub("[^a-zA-Z0-9]", "", s)
  return s

tag_list = [s for s in (toCamelCase(s.lower()) for s in tag_list.split(',')) if s]

the key here is to make use of re.sub to make the replacements you want.

EDIT : Doesn't preserve caps, but does handle uppercase strings with spaces

EDIT : Moved "if s" after the toCamelCase call

edited Aug 22 '12 at 22:28

answered Aug 22 '12 at 22:03

MWB

171
6

this is really close! note: it does not filter(None) entries – chrickso Aug 22 '12 at 22:16
It filtered out strings that were empty before calling toCamelCase... should be fixed now – MWB Aug 22 '12 at 22:33

Python regular expression to remove space and capitalize letters where the space was?

3 Answers3

Example

Output