1

I want to join multiple sets of characters iteratively in a string. Example:

mystr = 'T h i s _ i s _ a _ s e n t e n c e'
joins = [('e', 'n'), ('en', 't'), ('i', 's'), ('h', 'is')]

# do multiple replace
for bigram in joins:
  mystr = mystr.replace(' '.join(bigram), ''.join(bigram))
print(mystr)
'T his _ is _ a _ s ent en c e'

In the first iteration it joins e n into en, then en t into ent and so on. It's important that the joins are done in order, since the join ('en', 't') doens't work unless ('e', 'n') has been joined.

With a string of 20MB and 10k joins, this takes a while. I'm looking to optimize this, but I don't know how. Some of the things I've discarded:

  • I didn't use regex like in this question because I don't know how to do re.sub where the substitution is the match itself but joined together
  • I didn't use str.translate like this question either because as far as I know, translate can only translate single characters, and in my joins there are multiple

Is there any algorithm, string or regex or any other function that would allow me to do this? Thank you!

Ane
  • 43
  • 1
  • 7

1 Answers1

0

The straightforward way would be:

mystr = 'T h i s _ i s _ a _ s e n t e n c e'

bigrams = [('e', 'n'), ('en', 't'), ('i', 's'), ('h', 'is')]
for first_part, second_part in bigrams:
    mystr = mystr.replace(first_part + ' ' + second_part, first_part + second_part)
print(mystr)

Prints:

T his _ is _ a _ s ent en c e

A second way:

mystr = 'T h i s _ i s _ a _ s e n t e n c e'

bigrams = [('e', 'n'), ('en', 't'), ('i', 's'), ('h', 'is')]
for bigram in bigrams:
    mystr = mystr.replace(' '.join(bigram), ''.join(bigram))
print(mystr)

You would have to benchmark the two approaches.

Booboo
  • 38,656
  • 3
  • 37
  • 60
  • isn't this exactly what I've done? with the option of either getting the tuple wholly or each item individually, bit it's the same thing right? – Ane Dec 14 '20 at 10:53
  • Yes. But what you need to do is to take a subset of your bigrams and/or a smaller string and benchmark the two ways of doing the replace and see which one performs better. – Booboo Dec 14 '20 at 12:17
  • thank you, I tried and your first approach is slightly slower than the one I've done. My idea/goal was that there would be a way to bypass the loop and do all the replacements in one place, but it seems like the loop is unavoidable – Ane Dec 14 '20 at 13:01