Partitioning Multiple special characters in python

Question

I am trying to write a program which reads a paragraph which counts the special characters and words

My input:

list words ="'He came,"
words = list words. partition("'")

for i in words:
    list-1. extend(i.split())

print(list-1)

my output looks like this:

["'", 'He', 'came,']

but I want

["'", 'He', 'came', ',']

Can any one help me how to do this?

Is this your actual code? Because `list words ="'He came,"` is not valid Python syntax, and neither is `list-1.extend`. — Kevin, Oct 21 '14 at 14:12
possible duplicate of [Splitting a string into words and punctuation](http://stackoverflow.com/questions/367155/splitting-a-string-into-words-and-punctuation) — Adam Smith, Oct 21 '14 at 14:34
Actually no, these two questions have the same answer but are not the same question. I retracted my close vote. — Adam Smith, Oct 21 '14 at 14:43
Please update your question to elaborate on what you mean by "special characters". From your example it seems that you mean punctuation characters. Anyway, [*"all* characters are special"](http://stackoverflow.com/questions/9727097/how-to-match-with-regex-all-special-chars-except-in-php#comment12368164_9727097). — tripleee, Oct 21 '14 at 14:53

score 0 · Answer 1 · answered Oct 21 '14 at 15:08

I am trying to write a program which reads a paragraph which counts the special characters and words

Let's focus on the goal then, rather than your approach. Your approach is possible probably possible but it may take a bunch of splits so let's just ignore it for now. Using re.findall and a lengthy filtered regex should work much better.

lst = re.findall(r"\w+|[^\w\s]", some_sentence)

Would make sense. Broken down it does:

pat = re.compile(r"""
    \w+        # one or more word characters
    |          #   OR
    [^\w\s]    # exactly one character that's neither a word character nor whitespace
    """, re.X)

results = pat.findall('"Why, hello there, Martha!"')
# ['"', 'Why', ',', 'hello', 'there', ',', 'Martha', '!', '"']

However then you have to go through another iteration of your list to count the special characters! Let's separate them, then. Luckily this is easy -- just add capturing braces.

new_pat = re.compile(r"""
    (          # begin capture group
        \w+        # one or more word characters
    )          # end capturing group
    |          #   OR
    (          # begin capture group
        [^\w\s]    # exactly one character that's neither a word character nor whitespace
    )          # end capturing group
    """, re.X)

results = pat.findall('"Why, hello there, Martha!"')
# [('', '"'), ('Why', ''), ('', ','), ('hello', ''), ('there', ''), ('', ','), ('Martha', ''), ('', '!'), ('', '"')]

grouped_results = {"words":[], "punctuations":[]}

for word,punctuation in results:
    if word:
        grouped_results['words'].append(word)
    if punctuation:
        grouped_results['punctuations'].append(punctuation)
# grouped_results = {'punctuations': ['"', ',', ',', '!', '"'],
#                    'words': ['Why', 'hello', 'there', 'Martha']}

Then just count your dict keys.

>>> for key in grouped_results:
        print("There are {} items in {}".format(
            len(grouped_results[key]),
            key))

There are 5 items in punctuations
There are 4 items in words

I know there must be a fancy zip trick I could do to get `results` to turn into a list of `filter(None`'d lists, one with punctuation one without, but I can't think of it at the moment. — Adam Smith, Oct 21 '14 at 15:10
Since you asked so nicely: `words, punctuation = (list(filter(bool, v)) for v in zip(*results))`. — poke, Oct 21 '14 at 15:14
@poke thanks! I was going to use `zip(*results)` to swap the dimensions of the list, but wanted to filter in one step so decided to go with the loop. TBH the for loop is more readable anyhow :) — Adam Smith, Oct 21 '14 at 15:16
You can’t really filter first before transposing since you can’t really pair up the filtered elements again (e.g. this result would result a list of pairs like this: `[('Why', '"'), ('hello', ','), ('there', ','), ('Martha', '!'), ('????', '"')]` – but the last element has no word) — poke, Oct 21 '14 at 15:21
@poke That's true. I was hoping to filter WHILE transposing but that doesn't make much more sense. Something like `[filter(None,[word]), filter(None,[punctuation]) for word,punctuation in results]`, with my resulting list being `[['these','are','words'], ['this','is','punctuation']]` — Adam Smith, Oct 21 '14 at 23:16

Partitioning Multiple special characters in python

1 Answers1