How to reduce the duplicated characters in a string using Python

Question

Is there a way to reduce a duplicated characters to specific number, for example if we have this string.

"I liiiiked it, thaaaaaaank you"

Expected output: "I liiiiked it thaaaank you"

so if the duplicated character over 4, for example, it should be reduced to only four characters and if it less than or equal 4 then the word should stays the same.

Have you tried writing some code to solve this yet? If so, you should edit it into your question. — Marius, Jul 17 '13 at 01:51
Thanks for your help, I didn't really right the code.. I just thought about splitting the string into words then each word into list of characters and then iterate through these characters !! but thats manual and probably will take long time, specially thats my data is really large! ... so is their any other easier way? or pattern? — user2490790, Jul 17 '13 at 01:58
You could do it either by a loop or by a regex. This is your homework, right? Maybe you need to put in some work here yourself, if you hope to learn. — Thomas W, Jul 17 '13 at 02:00
Thank you dear, I'm working in a project and this might help me.. I just wanted to make sure that their is another way of doing it rather than the loop, its not possible to do it by a regex — user2490790, Jul 17 '13 at 02:07
Maybe not a single regular expression, but single characters are certainly workable using a regular expression: `re.sub(r'a{4,}', 'a', "I liiiiked it, thaaaaaaank you")` will produce `'I liiiiked it, thank you'`. — ChrisP, Jul 17 '13 at 02:21
@ChrisP, see the second part of my answer for the extra trick needed to get that to work. — John La Rooy, Jul 17 '13 at 03:06
Thank you so much for your help Chris and gnibbler, thats really amazing — user2490790, Jul 17 '13 at 03:36

John La Rooy · Accepted Answer · 2013-07-17T12:29:57.133

12

>>> import re
>>> s="I liiiiked it, thaaaaaaank you"
>>> re.sub(r"(.)(\1{3})(\1+)", r"\1\2", s)
'I liiiiked it, thaaaank you'

This regular expression looks for 3 groups.

The first is any character. The second is 3 more of that same character, and the third is one or more of the first character.

Those 3 groups are then replaced by just group 1 and group 2

Here is an even simpler method

>>> re.sub(r"(.)\1{4,}", r"\1"*4, s)
'I liiiiked it, thaaaank you'

This time there is just one group (.), which is the first letter of the match. This must be followed by the same letter 4 or more times \1{4,}. So it matches 5 or more of the same letter. The replacement is just that letter repeated 4 times.

edited Jul 17 '13 at 12:29

answered Jul 17 '13 at 02:42

John La Rooy

295,403
53
369
502

WOW!! thaaaank you SO much, that covers everything in a single line!... highly appreciated.. – user2490790 Jul 17 '13 at 02:56
Awesome - I knew there had to be a single line regex. Can you please explain the 2nd regex a bit like you did the first? – NG. Jul 17 '13 at 12:24

score 2 · Answer 2 · edited May 23 '17 at 12:05

You can do this with a single scan through the input string, just keep a count of the current character and don't add it to the output if you've got too many repeats:

input_string = "I liiiiked it, thaaaaaaank you"

max_reps = 4
prev_char = None
rep_count = 0
output = ""

for char in input_string:
    if not char == prev_char:
        rep_count = 1
        prev_char = char
        output += char
    else:
        if rep_count < max_reps:
            rep_count += 1
            output += char
        else:
            rep_count += 1

A version that's possibly faster by avoiding string concatenation (see this question):

input_string = "I liiiiked it, thaaaaaaank you"

max_reps = 4
prev_char = None
rep_count = 0
output_list = []

for char in input_string:
    if not char == prev_char:
        rep_count = 1
        prev_char = char
        output_list.append(char)
    else:
        if rep_count < max_reps:
            rep_count += 1
            output_list.append(char)
        else:
            rep_count += 1

output = ''.join(output_list)

Thank you, thats almost the same what I'm trying but the problem is that this way takes really long time... highly appreciate your help — user2490790, Jul 17 '13 at 03:00
@user2490790: The speed might be to do with the way Python handles strings, as discussed in the question I've linked. You could give the new list-based version a try, but I can't make any guarantees that it'll actually be faster. — Marius, Jul 17 '13 at 03:40

score 1 · Answer 3 · answered Jul 17 '13 at 02:31

Not the best solution - my regex needs to be fixed... I think

import re

def rep(o):
    g = o.group(0)
    if len(g) > 4:
        return g[0:3]
    return g

foo = 'iiiiiiii liiiiiiikkkkkkkkkeeeee fooooooddd'
foo1 = re.sub(r'(\w)\1+', rep, foo)

# iiii liiiikkkkeeee fooooddd

You can probably start tinkering with this if you are so inclined.

How to reduce the duplicated characters in a string using Python

3 Answers3