5

I would like to clean some input that was logged from my keyboard with python and regex. Especially when backspace was used to fix a mistake.

Example 1:

[in]:  'Helloo<BckSp> world'
[out]: 'Hello world'

This can be done with

re.sub(r'.<BckSp>', '', 'Helloo<BckSp> world')

Example 2:
However when I have several backspaces, I don't know how to delete exactly the same number of characters before:

[in]:  'Helllo<BckSp><BckSp>o world'
[out]: 'Hello world'

(Here I want to remove 'l' and 'o' before the two backspaces).

I could simply use re.sub(r'[^>]<BckSp>', '', line) several times until there is no <BckSp> left but I would like to find a more elegant / faster solution.

Does anyone know how to do this ?

Louis M
  • 4,036
  • 4
  • 21
  • 25
  • 1
    I think you cannot count with regex and just looping through your regex as suggested is the best way – Fallenhero Dec 27 '16 at 10:32
  • 1
    Is using regex a requirement (i.e. you're learning regex) or just your proposed solution? – Thijs van Dien Dec 27 '16 at 10:49
  • Yes I try to use regex to learn because I am not familiar with it yet. – Louis M Dec 27 '16 at 10:58
  • 1
    Please keep in mind then that, although there's likely some regex only solution without a loop, regex are not a first choice and in this case you'd be much better off with a simpler, better understandable solution. – Thijs van Dien Dec 27 '16 at 11:04
  • Thanks for the advice, I'll keep that in mind and may not use regex for this case then :) – Louis M Dec 27 '16 at 11:13

5 Answers5

2

It looks like Python does not support recursive regex. If you can use another language, you could try this:

.(?R)?<BckSp>

See: https://regex101.com/r/OirPNn/1

Fallenhero
  • 1,563
  • 1
  • 8
  • 17
2

It isn't very efficient but you can do that with the re module:

(?:[^<](?=[^<]*((?=(\1?))\2<BckSp>)))+\1

demo

This way you don't have to count, the pattern only uses the repetition.

(?: 
    [^<] # a character to remove
    (?=  # lookahead to reach the corresponding <BckSp>
        [^<]* # skip characters until the first <BckSp>
        (  # capture group 1: contains the <BckSp>s
            (?=(\1?))\2 # emulate an atomic group in place of \1?+
                        # The idea is to add the <BcKSp>s already matched in the
                        # previous repetitions if any to be sure that the following
                        # <BckSp> isn't already associated with a character
            <BckSp> # corresponding <BckSp>
        )
    )
)+ # each time the group is repeated, the capture group 1 is growing with a new <BckSp>

\1 # matches all the consecutive <BckSp> and ensures that there's no more character
   # between the last character to remove and the first <BckSp>

You can do the same with the regex module, but this time you don't need to emulate the possessive quantifier:

(?:[^<](?=[^<]*(\1?+<BckSp>)))+\1

demo

But with the regex module, you can also use the recursion (as @Fallenhero noticed it):

[^<](?R)?<BckSp>

demo

Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
1

Since there is no support for recursion/subroutine calls, no atomic groups/possessive quantifiers in Python re, you may remove these chars followed with backspaces in a loop:

import re
s = "Helllo\b\bo world"
r = re.compile("^\b+|[^\b]\b")
while r.search(s): 
    s = r.sub("", s)
print(s)

See the Python demo

The "^\b+|[^\b]\b" pattern will find 1+ backspace chars at the string start (with ^\b+) and [^\b]\b will find all non-overlapping occurrences of any char other than a backspace followed with a backspace.

Same approach in case a backspace is expressed as some enitity/tag like a literal <BckSp>:

import re
s = "Helllo<BckSp><BckSp>o world"
r = re.compile("^(?:<BckSp>)+|.<BckSp>", flags=re.S)
while r.search(s): 
    s = r.sub("", s)
print(s)

See another Python demo

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
1

In case the marker is single character you could just utilize stack which would give you the result in single pass:

s = "Helllo\b\bo world"
res = []

for c in s:
    if c == '\b':
        if res:
            del res[-1]
    else:
        res.append(c)

print(''.join(res)) # Hello world

In case the marker is literally '<BckSp>' or some other string with length greater than 1 you can use replace to substitute it to '\b' and use the solution above. This only works if you know that '\b' doesn't occur in the input. If you can't designate a substitute character you could use split and process the results:

s = 'Helllo<BckSp><BckSp>o world'
res = []

for part in s.split('<BckSp>'):
    if res:
        del res[-1]
    res.extend(part)

print(''.join(res)) # Hello world
niemmi
  • 17,113
  • 7
  • 35
  • 42
  • Nice approach. Would you have a workaround if the marker is `` ? Maybe replacing it with `\b` would be the simplest ... – Louis M Dec 27 '16 at 13:32
  • 1
    @LouisM Replace would be the easiest option if you know a character that doesn't occur in input. I've added alternative solution for a case where you can't designate any single character to use as substitute. – niemmi Dec 27 '16 at 14:02
1

Slightly verbose but you can use this lambda function to count # of <BckSp> occurrence and use substring routines to get your final output.

>>> bk = '<BckSp>'

>>> s = 'Helllo<BckSp><BckSp>o world'
>>> print re.sub(r'(.*?)((?:' + bk + ')+)', lambda x: x.group(1)[0:len(x.group(1)) - len(x.group(2))/len(bk)], s)
Hello world

>>> s = 'Helloo<BckSp> world'
>>> print re.sub(r'(.*?)((?:' + bk + ')+)', lambda x: x.group(1)[0:len(x.group(1)) - len(x.group(2))/len(bk)], s)
Hello world

>>> s = 'Helloo<BckSp> worl<BckSp>d'
>>> print re.sub(r'(.*?)((?:' + bk + ')+)', lambda x: x.group(1)[0:len(x.group(1)) - len(x.group(2))/len(bk)], s)
Hello word

>>> s = 'Helllo<BckSp><BckSp>o world<BckSp><BckSp>k'
>>> print re.sub(r'(.*?)((?:' + bk + ')+)', lambda x: x.group(1)[0:len(x.group(1)) - len(x.group(2))/len(bk)], s)
Hello work
Community
  • 1
  • 1
anubhava
  • 761,203
  • 64
  • 569
  • 643