Replace single instances of a character that is sometimes doubled

Question

I have a string with each character being separated by a pipe character (including the "|"s themselves), for example:

"f|u|n|n|y||b|o|y||a||c|a|t"

I would like to replace all "|"s which are not next to another "|" with nothing, to get the result:

"funny|boy|a|cat"

I tried using mytext.replace("|", ""), but that removes everything and makes one long word.

It is not true that "*each character (including "|") being separated by a "|" character*". If that was true, you would have `"f|u|n|n|y|||b|o|y|||a|||c|a|t"`. — zvone, Dec 25 '15 at 16:45

score 29 · Answer 1 · answered Dec 25 '15 at 16:30

29

This can be achieved with a relatively simple regex without having to chain str.replace:

>>> import re
>>> s = "f|u|n|n|y||b|o|y||a||c|a|t"
>>> re.sub('\|(?!\|)' , '', s)
'funny|boy|a|cat'

Explanation: \|(?!\|) will look for a | character which is not followed by another | character. (?!foo) means negative lookahead, ensuring that whatever you are matching is not followed by foo.

answered Dec 25 '15 at 16:30

timgeb

76,762
20
123
145

6

+1 this method won't break when the format changes to allow `-` characters, unlike the currently top-voted answer. – BlueRaja - Danny Pflughoeft Dec 25 '15 at 21:02
@BlueRaja-DannyPflughoeft True! Also look at the timing comparison in Padraic's [answer](http://stackoverflow.com/a/34472163/4099593). Regards – Bhargav Rao Dec 26 '15 at 17:06
@BhargavRao Unless you're processing millions of strings, or strings of millions of characters, performance is irrelevant. – BlueRaja - Danny Pflughoeft Dec 27 '15 at 01:42

Bhargav Rao · Answer 2 · 2015-12-26T05:43:52.220

28

Use sentinel values

Replace the || by ~. This will remember the ||. Then remove the |s. Finally re-replace them with |.

>>> s = "f|u|n|n|y||b|o|y||a||c|a|t"
>>> s.replace('||','~').replace('|','').replace('~','|')
'funny|boy|a|cat'

Another better way is to use the fact that they are almost alternate text. The solution is to make them completely alternate...

s.replace('||','|||')[::2]

edited Dec 26 '15 at 05:43

answered Dec 25 '15 at 16:30

Bhargav Rao

50,140
28
121
140

3

This is 5 times faster than the regex solutions with the regex compiled, `700 ns vs 3.9 µs` – Padraic Cunningham Dec 25 '15 at 16:46
1

@Padraic Yep, they are "Regex" ;) – Bhargav Rao Dec 25 '15 at 16:48
Does this solution still work if somebody uses the sentinel value in the input? – hagello Dec 26 '15 at 00:36
@hagello sentinel value is defined as that which is not present in the input. So yes, the sentinel must be chosen with care – Bhargav Rao Dec 26 '15 at 05:07
You write the code (defining the sentinel value) first. Afterwards the user creates the input. He has to create "correct input", this means without sentinel. The developer delegates the responsibility on the user. If you code defensively you have to check that the sentinel is not part of the input. – hagello Dec 26 '15 at 09:21
@hagello Yeah, there are a few unprintable ASCII characters that can't be in the input. So you can certainly use a combination of them also – Bhargav Rao Dec 26 '15 at 09:40
@bhargav "Unprintable" is the same as "impossible"? – hagello Dec 26 '15 at 11:11
@hagello Nope, but it can handle more cases. I mentioned it as an alternative. The programmer will certainly know what to best in exceptional circumstances (where we can use LBYL or EAFP.) – Bhargav Rao Dec 26 '15 at 11:15
yes. not sure why everyone is so excited by the regex solutions and there is also a faster way to do it using a re.split and str.join than any other method provided, worst case splitting on `||` and doing the replace after is faster than any regex and replace will work in most cases. – Padraic Cunningham Dec 26 '15 at 14:36
@PadraicCunningham Yep. Regex excites everyone I guess. The problem is they are not taking the performance under consideration. If it looks nice, that's enough. – Bhargav Rao Dec 26 '15 at 14:38
2

Yep and readability also counts, much easier read a list comp or a replace than a regex and when the non regex is also faster it does not make a great deal of sense to want a regex – Padraic Cunningham Dec 26 '15 at 14:40

poke · Answer 3 · 2015-12-25T18:20:19.980

23

You could replace the double pipe by something else first to make sure that you can still recognize them after removing the single pipes. And then you replace those back to a pipe:

>>> t = "f|u|n|n|y||b|o|y||a||c|a|t"
>>> t.replace('||', '|-|').replace('|', '').replace('-', '|')
'funny|boy|a|cat'

You should try to choose a replacement value that is a safe temporary value and does not naturally appear in your text. Otherwise you will run into conflicts where that character is replace even though it wasn’t a double pipe originally. So don’t use a dash as above if your text may contain a dash. You can also use multiple characters at once, for example: '<THIS IS A TEMPORARY PIPE>'.

If you want to avoid this conflict completely, you could also solve this entirely different. For example, you could split the string by the double pipes first and perform a replacement on each substring, ultimately joining them back together:

>>> '|'.join([s.replace('|', '') for s in t.split('||')])
'funny|boy|a|cat'

And of course, you could also use regular expressions to replace those pipes that are not followed by another pipe:

>>> import re
>>> re.sub('\|(?!\|)', '', t)
'funny|boy|a|cat'

edited Dec 25 '15 at 18:20

answered Dec 25 '15 at 16:30

poke

369,085
72
557
602

is `'-'` -> `'|'` expected behaviour? – Caridorc Dec 25 '15 at 18:05
@Caridorc I explained that behavior in the second paragraph. Of course you could (and probably should if you know your input) a better suited temporary value to replace the double pipes with, that has fewer if any conflicts. – poke Dec 25 '15 at 18:11
Correct, I suggest using an unprintable character, very unlikely to be in the original string – Caridorc Dec 25 '15 at 18:14
@Caridorc I’ve expanded the answer a bit more to highlight that issue a bit better. Hope that makes it clearer :) – poke Dec 25 '15 at 18:21
1

It is very clear now :) thanks for taking the time to explain in detail. – Caridorc Dec 25 '15 at 18:23
@WorldSEnder Joining a list comprehension [is more efficient](http://stackoverflow.com/a/9061024/216074) than joining on a generator expression. – poke Dec 25 '15 at 20:50
@poke, didn't know that, thanks for the enlightement – WorldSEnder Dec 25 '15 at 20:52
@caridorc Do not ever make any assumptions on your input! A correct solution works on any input, even on the program itself. So please do not use the first solution and do not rely on any unprintable characters! – hagello Dec 26 '15 at 00:21
2

@hagello The answer very clearly lists out the problem with the simple (and *very efficient and fast* solution) and not only suggests to use a safe replacement value but also shows two other solutions that do not have that problem. I don’t how this makes the answer still so bad *in total* that a downvote is even remotely appropriate. A downvote says there is something wrong with the answer, but I don’t see how showing a simple and efficient solution, its downsides, and two alternatives is wrong at all. – poke Dec 26 '15 at 00:44
@poke Now I see a drawback of anonymous votes: If you add an unfavorable comment, you get blamed for any downvotes. By the way, I like your idea of splitting the input string. – hagello Dec 26 '15 at 01:02
@hagello I’m sorry for misattributing that complaint then; it appears to have been bad timing of that downvote and your comment. Still my point stands: Sometimes it makes sense to use a solution that has possible downsides (and yes, there are many situations in which you can make assumptions on the input). The fact that the simple replace solution is about three times as fast as the regular expression solution makes it valuable in some situations. – poke Dec 26 '15 at 01:12

Mazdak · Answer 4 · 2015-12-25T16:44:32.067

10

You can use a positive look ahead regex to replace the pips that are followed with an alphabetical character:

>>> import re
>>> st = "f|u|n|n|y||b|o|y||a||c|a|t" 
>>> re.sub(r'\|(?=[a-z]|$)',r'',st)
'funny|boy|a|cat'

edited Dec 25 '15 at 16:44

answered Dec 25 '15 at 16:30

Mazdak

105,000
18
159
188

This would fail for `st = "f|u|n|n|y||b|o|y||a||c|a|t|"`, you would need to catch the ending pipe – Padraic Cunningham Dec 25 '15 at 16:43
2

@PadraicCunningham Yep, I added the anchor `$`.Thanks for note. – Mazdak Dec 25 '15 at 16:44

Nighthacks · Answer 5 · 2015-12-25T17:02:45.033

6

Use regular expressions.

import re

line = "f|u|n|n|y||b|o|y||a||c|a|t" 
line = re.sub("(?!\|\|)(\|)", "", line)

print(line)

Output :

funny|boy|a|cat

edited Dec 25 '15 at 17:02

answered Dec 25 '15 at 16:50

Nighthacks

410
1
4
13

You don't need a capturing group. – Avinash Raj Dec 26 '15 at 02:31

Avinash Raj · Answer 6 · 2015-12-27T17:29:22.350

An another regex option with capturing group.

>>> import re
>>> re.sub(r'\|(\|?)', r'\1', "f|u|n|n|y||b|o|y||a||c|a|t")
'funny|boy|a|cat'

Explanation:

\| - Matches all the pipe characters. (\|?) - Captures the following pipe character if present. Then replacing the match with \1 will bring you the content of first capturing group. So in the place of single pip, it would give an empty string and in ||, it would bring the second pipe character.

Another trick through word and non-word boundaries...

>>> re.sub(r'\b\|\b|\b\|\B', '', "f|u|n|n|y||b|o|y||a||c|a|t|")
'funny|boy|a|cat'

Yet another one using negative lookbehind..

>>> re.sub(r'(?<!\|)\|', '', "f|u|n|n|y||b|o|y||a||c|a|t|")
'funny|boy|a|cat'

Bonus...

>>> re.sub(r'\|(\|)|\|', lambda m: m.group(1) if m.group(1) else '', "f|u|n|n|y||b|o|y||a||c|a|t")
'funny|boy|a|cat'

Oh, `re.sub` can take a callable... time to rewrite some code. +1 — timgeb, Dec 27 '15 at 17:25
ya, don't know that? Then you learned something new today :-) — Avinash Raj, Dec 27 '15 at 17:30

Padraic Cunningham · Answer 7 · 2015-12-26T14:22:43.727

If you are going to use a regex, the fastest method which is to split and join:

In [18]: r = re.compile("\|(?!\|)")

In [19]: timeit "".join(r.split(s))
100000 loops, best of 3: 2.65 µs per loop
In [20]:  "".join(r.split(s))
Out[20]: 'funny|boy|a|cat'
In [30]: r1 = re.compile('\|(?!\|)')

In [31]: timeit r1.sub("", s)
100000 loops, best of 3: 3.20 µs per loop

In [33]: r2 = re.compile("(?!\|\|)(\|)")
In [34]: timeit r2.sub("",s)
100000 loops, best of 3: 3.96 µs per loop

The str.split and str.replace methods are still faster:

In [38]: timeit '|'.join([ch.replace('|', '') for ch in s.split('||')])
The slowest run took 11.18 times longer than the fastest. This could mean that an intermediate result is being cached 
100000 loops, best of 3: 1.71 µs per loop

In [39]: timeit s.replace('||','|||')[::2]
1000000 loops, best of 3: 536 ns per loop

In [40]: timeit s.replace('||','~').replace('|','').replace('~','|')
1000000 loops, best of 3: 881 ns per loop

Depending on what can be in the string will determine the str.replaceapproach but the str.split method will work no matter what characters are in the string.

True, we now have the results to prove also :) – Bhargav Rao Dec 26 '15 at 14:40 — Bhargav Rao, Dec 26 '15 at 14:40

Replace single instances of a character that is sometimes doubled

7 Answers7

Linked