1

I am trying to remove quoted sequences from a string. For the example below my script works fairly:

import re
doc = ' Doc = "This is a quoted string: this is cool!" '
cleanr = re.compile('\".*?\"')
doc = re.sub(cleanr, '', doc)
print doc

Result (as expected):

' Doc =  '

However when I have escaped string inside the quoted sentence I am not able to remove the escaped sequence using the pattern that I think would be the right one:

import re
doc = ' Doc = "This is a quoted string: \"this is cool!\" " '
cleanr = re.compile('\\".*?\\"') # new pattern
doc = re.sub(cleanr, '', doc)
print doc

Result

'Doc = this is cool!'

Expected:

'Doc = "This is a quoted string: " '

Does anyone knows what is happening? If the pattern '\\".*?\\"' is wrong what would be the right one?

TigerhawkT3
  • 48,464
  • 6
  • 60
  • 97
Montenegrodr
  • 1,597
  • 1
  • 16
  • 30
  • When you send the first and second expressions to the `re` module, they both end up as the same expression because of runaway escaping. Use raw strings to avoid this issue. – TigerhawkT3 Jun 27 '16 at 11:04
  • That question is very well asked and clear, I really don't see any reason for downvoting it. – Maroun Jun 27 '16 at 11:05

1 Answers1

2

doc doesn't contain any escaped characters, so your regex doesn't match.

Add the r prefix to the string, which means that it should be treated as a raw string, ignoring escaped codes.

Try this:

>>> doc = r' Doc = "This is a quoted string: \"this is cool!\" " '
>>> cleanr = re.compile(r'\\".*?\\"')
>>> re.sub(cleanr, '', doc)
' Doc = "This is a quoted string:  " '
Community
  • 1
  • 1
Maroun
  • 94,125
  • 30
  • 188
  • 241
  • Thank you guys for the prompt answer. It worked perfectly. – Montenegrodr Jun 27 '16 at 11:09
  • Note that this answer assumes that you are able to define your `doc` as a literal in your code. If you can do that, great. If you're getting it from another source, you better hope that it includes a literal backslash. – TigerhawkT3 Jun 27 '16 at 11:11