How to determine if all substrings in a string contains duplicates?

Question

I'm facing this issue:

I need to remove duplications from the beginning of each word of a text, but only if all words in the text are duplicated. (And capitalized after)

Examples:

text = str("Thethe cacar isis momoving vvery fasfast")

So this text should be treated and printed as:

    output: 
"The car is moving very fast"

I got these to treat the text:

phrase = str("Thethe cacar isis momoving vvery fasfast")
phrase_up = phrase.upper()
text = re.sub(r'(.+?)\1+', r'\1', phrase_up)
text_cap = text.capitalize()
"The car is moving very fast"

Or:

def remove_duplicates(word):
    unique_letters = set(word)
    sorted_letters = sorted(unique_letters, key=word.index) 
    return ''.join(sorted_letters)

words = phrase.split(' ')
new_phrase = ' '.join(remove_duplicates(word) for word in words)

What I can't work it out, is HOW to determine if a text needs this treatment. Because if we get a text such as:

"This meme is funny, said Barbara"

Where even though "meme" and "Barbara" (ar - ar) are repeating substrings, not all are, so this text shouldn't be treated.

Any pointers here?

the answer is you can't without external knowledge, such as a dictionary. you need something that can tell you "ok this is a legal word, do not check the regex" — juuso, Dec 16 '21 at 14:01

juuso · Answer 1 · 2021-12-16T14:11:35.750

I would suggest you to adopt a solution to check if a word is legal, using something like what is described in this post's best answer. If the word is not an english word, than you should use the regex.

For example, a word like meme should be in the english dictionary, so you should not check for repetitions.

So I would firstly split the string on spaces, in order to have the tokens. Then check if a token is an english word. If it is, skip the regex check. Otherwise check for repetitions

How to determine if all substrings in a string contains duplicates?

1 Answers1