How to efficiently remove consecutive duplicate words or phrases in a string

Question

I have a string which has reoccurring phrases or it might even be a single word which is occurring multiple times continuously.

Tried various methods but unable to find a better approach that is time and space-efficient.

Here are the approaches I tried

groupby()
re

String = "what type of people were most likely to be able to be able to be able to be able to be 1.35 ?"
s1 = " ".join([k for k,v in groupby(String.replace("&lt;/Sent&gt;","").split())])
s2 = re.sub(r'\b(.+)(\s+\1\b)+', r'\1', String)

both of them doesn't seem to work in my case

My expected result:

what type of people were most likely to be able to be 1.35 ?

These are some posts that I referred to

Is there a way to remove duplicate and continuous words/phrases in a string? - Doesn't work
How can I remove duplicate words in a string with Python? - Works partially but need an optimal way for large strings also

Please don't flag my question as a duplicate with the posts above as I tried most of the implementations and didn't find an efficient solution.

this may be related: https://stackoverflow.com/questions/11090289/find-longest-repetitive-sequence-in-a-string — hiro protagonist, Aug 09 '19 at 06:44
Thanks @hiroprotagonist it helps, But it doesn't solve my question. — , Aug 09 '19 at 06:48
I believe this is quite a hard problem, especially since the string is not limited any number of characters, and you have to to check pretty much every combination to repeated occurences. — WiseDev, Aug 09 '19 at 06:50
@BlueRineS Yes i agree, Do you have an approach to solve this that can help me ? — , Aug 09 '19 at 06:53
I'm not sure, maybe we need to use better search key words on the net. I would guess there is some kind of algorithm that solves this, but we simply don't know its name. — WiseDev, Aug 09 '19 at 06:55
Btw, what is your output if you simply use the regex method as suggested in your first link? Why doesn't it work? — WiseDev, Aug 09 '19 at 06:56
Yeah I see. that problem is that you have repeated phrases, not just words. — WiseDev, Aug 09 '19 at 07:03
"*Please dont flag my question as a duplicate with the posts above.*" Why? The problem in the first link seems to be exactly the same as yours. You commented *"Doesn't work"* but this is not helpful at all. There are many solutions there. Have you tried them all? How are they failing in your case? Please, provide more details. — Georgy, Aug 09 '19 at 08:38
@Georgy tried everything but nothing seems to be much effective as the accepted solution but it lacks the time complexity which is O(N²). but the already provided solutions here are much better when compared to what i found there and Harm Campmans has a brilliant solution which makes the solution to the problem even better. — , Aug 09 '19 at 09:52
And no one was talking about the example which @BlueRineS pointed out earlier. — , Aug 09 '19 at 09:53

Harm Campmans · Accepted Answer · 2019-08-09T08:19:50.437

I'd go for this creative method of looking for duplicates of growing length:

input = "what type of people were most likely to be able to be able to be able to be able to be 1.35 ?"
def combine_words(input,length):
    combined_inputs = []
    if len(splitted_input)>1:
        for i in range(len(input)-1):
            combined_inputs.append(input[i]+" "+last_word_of(splitted_input[i+1],length)) #add the last word of the right-neighbour (overlapping) sequence (before it has expanded), which is the next word in the original sentence
    return combined_inputs, length+1

def remove_duplicates(input, length):
    bool_broke=False #this means we didn't find any duplicates here
    for i in range(len(input) - length):
        if input[i]==input[i + length]: #found a duplicate piece of sentence!
            for j in range(0,length): #remove the overlapping sequences in reverse order
                del input[i + length - j]
            bool_broke = True
            break #break the for loop as the loop length does not matches the length of splitted_input anymore as we removed elements
    if bool_broke:
        return remove_duplicates(input, length) #if we found a duplicate, look for another duplicate of the same length
    return input

def last_word_of(input,length):
    splitted = input.split(" ")
    if len(splitted)==0:
        return input
    else:
        return splitted[length-1]

#make a list of strings which represent every sequence of word_length adjacent words
splitted_input = input.split(" ")
word_length = 1
splitted_input,word_length = combine_words(splitted_input,word_length)

intermediate_output = False

while len(splitted_input)>1:
    splitted_input = remove_duplicates(splitted_input,word_length) #look whether two sequences of length n (with distance n apart) are equal. If so, remove the n overlapping sequences
    splitted_input, word_length = combine_words(splitted_input,word_length) #make even bigger sequences
    if intermediate_output:
        print(splitted_input)
        print(word_length)
output = splitted_input[0] #In the end you have a list of length 1, with all possible lengths of repetitive words removed

which outputs a fluent

what type of people were most likely to be able to be 1.35 ?

Even though it is not the desired ouput, I don't see how it would recognize to remove "to be" (of length 2) which occured 3 places earlier.

Agree with your answer.As described by @ComplicatedPhenomenon. — , Aug 09 '19 at 07:47
I don't think it gets any nicer than this without going into grammar. I'd say pick your answer. Also note that if the questionmark would be connected to the last word of a sentence, this might give some trouble as the last word has changed by this and hence not be found as a duplicate word. — Harm Campmans, Aug 09 '19 at 08:01
I feel like you should be able to time it yourself with the `timeit` package mentioned by @SayandipDutta. It runs slightly slower than the other answer. — Harm Campmans, Aug 09 '19 at 09:55

score 1 · Answer 2 · answered Aug 09 '19 at 06:56

1

I am pretty sure in this approach the order is maintained in Python 3.7, I am not exactly sure of older versions.

String = "what type of people were most likely to be able to be able to be able to be able to be 1.35 ?"
unique_words = dict.fromkeys(String.split())
print(' '.join(unique_words))
>>> what type of people were most likely to be able 1.35 ?

answered Aug 09 '19 at 06:56

Sayandip Dutta

15,602
4
23
52

This works, any info on the time complexity ? – Aug 09 '19 at 07:08
1

Right, if put grammar mistake into consideration, like`String = "what type of people were most likely to be able to be able to work?"` will become `what type of people were most likely to be able work?` the problem become much diffcult. – ComplicatedPhenomenon Aug 09 '19 at 07:09
1

@ComplicatedPhenomenon Yeah exactly a good example where it looses the `to` at the end – Aug 09 '19 at 07:13
Yeah, well if you incorporate grammar that's a whole different ballgame, I just tried to match the expected output by OP. Btw, @Dev `timeit` for 100000 iterations gives `0.25722896799925365 s` – Sayandip Dutta Aug 09 '19 at 07:14
@SayandipDutta Well i don't want it to be incorporated with grammar anyways so let me use this for now. Thanks! ,Will be waiting for any different approach that could fix the grammar as well if not ill accept the answer. – Aug 09 '19 at 07:22
Yeah, I am also waiting for an approach with grammar fixed, I will see if I can do this without NLTK. But I feel you should add a note to your post saying 'if possible maintain grammar' or something, otherwise people may not try. – Sayandip Dutta Aug 09 '19 at 07:30

How to efficiently remove consecutive duplicate words or phrases in a string

2 Answers2