2

I would like to convert the following string

"For "The" Win","Way "To" Go"

to

"For ""The"" Win","Way ""To"" Go"

The straightforward regex would be

str2 = re.sub(r'(?<!,|^)"(?=\w)|(?<=\w)"(?!,|$)', '""', str1,flags=re.MULTILINE)

i.e., Double the quotes that are

  1. Followed by a letter but not preceded by a comma or the beginning of line
  2. Preceded by a letter but not followed by a comma or the end of line

The problem is I am using python and it's regex engine does not allow using the OR operator in the lookbehind construct. I get the error

sre_constants.error: look-behind requires fixed-width pattern

What I am looking for is a regex that will replace the '"' around 'The' and 'To' with '""'. I can use the following regex (An answer provided to another question)

\b\s*"(?!,|[ \t]*$)

but that consumes the space just before the 'The' and 'To' and I get the below

"For""The"" Win","Way""To"" Go"

Is there a workaround so that I can double the quotes around 'The' and 'To' without consuming the spaces just before them?

Community
  • 1
  • 1
SpikETidE
  • 6,711
  • 15
  • 46
  • 62
  • For such a string : ``"For "The" mar"vel"ous Win"``, do you want the quotation marks inside the noun to be changed or not ? – eyquem Nov 21 '13 at 13:27

5 Answers5

2

Instead of saying not preceded by comma or the line start, say preceded by a non-comma character:

r'(?<=[^,])"(?=\w)|(?<=\w)"(?!,|$)'
perreal
  • 94,503
  • 21
  • 155
  • 181
  • If negation is used then I guess I have to find all possible characters that I have to negate. For Ex: r'(?<=[^,\n])"(?=\w)|(?<=\w)"(?!,|$)' – SpikETidE Nov 21 '13 at 11:27
  • @SpikETidE, does this not produce the desired output? – perreal Nov 21 '13 at 11:36
  • 1
    @SpikETidE Please, put portions of code between two characters ` ` at the left and two characters ` ` at the right of it. Click on **help** at the right of the comment window – eyquem Nov 21 '13 at 11:37
  • @perreal : It works with the modification that I have mentioned in my previous comment. – SpikETidE Nov 21 '13 at 11:45
2

Looks to me like you don't need to bother with anchors.

  • If there is a character before the quote, you know it's not at the beginning of the string.
  • If that character is not a newline, you're not at the beginning of a line.
  • If the character is not a comma, you're not at the beginning of a field.

So you don't need to use anchors, just do a positive lookbehind/lookahead for a single character:

result = re.sub(r'(?<=[^",\r\n])"(?=[^,"\r\n])', '""', subject)

I threw in the " on the chance that there might be some quotes that are already escaped. But realistically, if that's the case you're probably screwed anyway. ;)

eyquem
  • 26,771
  • 7
  • 38
  • 46
Alan Moore
  • 73,866
  • 12
  • 100
  • 156
  • Alas, the OP is the kind of person that wait good answers even when asking wrongly written questions. See my comment in the Markus answer. – eyquem Nov 21 '13 at 12:54
  • The problem of this solution is that ``"For "The" " hourrah!" Win"`` is changed into ``"For ""The"" "" hourrah!"" Win"`` while it should be into ``"For ""The"" " hourrah!" Win"`` ; in case the conditions ``(?=\w)`` and ``(?<=\w)`` in the OP's question are really what he wants, which I'm not sure of. – eyquem Nov 21 '13 at 13:23
  • I figured the OP was limiting his thinking to a specific example where the quotes he needs to escape always happen to be next to letters, and he was letting that detail get in his way. My attitude is that we're not here to answer questions so much as to help people find the right questions to ask. But, as you observed in the first comment, not everyone feels the same way. – Alan Moore Nov 21 '13 at 14:59
  • _"we're not here to answer questions so much as to help people find the right questions to ask"_ It depends of the level of efforts someone is willing to produce to help someone else. I confess it's disappointing when the question is so far badly written. I fear I shouldn't be disappointed in the case of an OP mainly practicing PHP, which isn't a language that educates to rigor... – eyquem Nov 21 '13 at 15:30
1
re.sub(r'\b(\s*)"(?!,|[ \t]*$)', r'\1""', s)
Markus Jarderot
  • 86,735
  • 21
  • 136
  • 138
  • @SpikETidE So, ``'(?<!,|^)"(?=\w)|(?<=\w)"(?!,|$)'`` isn't the straightforward regex pattern that would be right if Python would have lookbehind assertion of variable length, and the conditions ``not preceded by a comma or the beginning of line`` and ``not followed by a comma or the end of line`` are not the real conditions, and the examples you wrote are not good examples, and after 4 years of membership and 49 questions form you on SO, you always don't know to ask a question. I'm not the only one who was led to misunderstanding, for all the answerers other than Markus did the same as me. – eyquem Nov 21 '13 at 12:50
1

Most direct workaround whenever you encounter this issue: explode the look-behind into two look-behinds.

str2 = re.sub(r'(?<!,)(?<!^)"(?=\w)|(?<=\w)"(?!,|$)', '""', str1,flags=re.MULTILINE)

(don't name your strings str)

roippi
  • 25,533
  • 4
  • 48
  • 73
  • @SpikETidE perhaps try it again? I just tested and it works properly. As it should - the two lookbehind assertions are logically equivalent. – roippi Nov 21 '13 at 11:23
  • @ roippi : Sorry about that. There was a difference in my test string and that made it fail. I deleted my comment just before you posted the reply. – SpikETidE Nov 21 '13 at 11:24
0
str2 = re.sub('(?<=[^,])"(?=\w)'
              '|'
              '(?<=\w)"(?!,|$)',

              '""',  ss,
              flags=re.MULTILINE)

I always wonder why people use raw strings for regex patterns when it isn't needed.

Note I changed your str which is the name of a builtin class to ss

.

For `"fun" :

str2 = re.sub('"'
              '('
              '(?<=[^,]")(?=\w)'
              '|'
              '(?<=\w")(?!,|$)'
              ')',

              '""', ss,
              flags=re.MULTILINE)

or also

str2 = re.sub('(?<=[^,]")(?=\w)'
              '|'
              '(?<=\w")(?!,|$)',

              '"',  ss,
              flags=re.MULTILINE)
eyquem
  • 26,771
  • 7
  • 38
  • 46
  • 1
    Without `r`, the string will be processed as a normal string, with string escapes. Some escapes in regex have a different meaning in a non-raw string literal (`\1` and `\b` are two examples). Instead of adjusting the "rawness" of a string depending on whether or not you use those, it is easier to always put `r` on regexes. – Markus Jarderot Nov 21 '13 at 10:44
  • I know, I know. Personally, as the regex patterns I use are less frequently with ``\1`` and ``\b``, I prefer not putting the ``r`` in front of all the regex patterns I use more frequently, and to write ``\\1 \\2`` and ``\\b`` when necessary. By the way it's a strange thing that we have to write ``\\b`` if not in a raw string, while ``\d \w \s etc`` don't need the same. I don't have in mind if there are other special sequences than the ones you cite that need to be double-slashed in a non-raw-string or to be in a raw-string to work correctly. Do you ? – eyquem Nov 21 '13 at 11:29
  • @Markus In fact, what is strange is that an escape based on the letter ``b`` has been choosen as well in strings to repreent a backspace as in regex patterns to represent a boudary. Then the principle that makes ``'\r'`` and ``'\\r'`` equivalent in a non-raw-string regex pattern can't apply to escapes of ``b``: in a regex pattern , a non-raw-string ``'\b'`` means a backspace and ``'\\b'`` in a non-raw-string pattern or ``r'\b'`` means a boudary. – eyquem Nov 21 '13 at 15:08
  • Consequently, the non-raw-string regex pattern ``'(Mar.)\b'`` is the only way to match ``Mari`` in ``'c:\Mary\yellow\Jimmy_and_Mari\bushka'`` and not ``Mary`` – eyquem Nov 21 '13 at 15:15