-1

I need a regex that will parse a string from a string.

To show you what I mean, imagine that the following is the content of the string to parse:

"a string" ... \\"another \"string\"\\" ... "yet another \"string" ... "failed string\" 

where "..." denotes some arbitrary data.

The regex would need to return the list:

["a string", "another \"string\"\\", "yet another \"string"] 

Edit: Note that the literal backslashes don't stop the second match

I've tried finditer but it won't find overlapping matches, and I tried the lookahead (?=) but I couldn't get that to work either.

Help?

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
globby
  • 609
  • 1
  • 7
  • 9
  • 1
    What have you attempted so far, please provide that. – hwnd Aug 31 '14 at 00:50
  • possible duplicate of [Regex for quoted string with escaping quotes](http://stackoverflow.com/questions/249791/regex-for-quoted-string-with-escaping-quotes) – simonzack Aug 31 '14 at 13:57

4 Answers4

1

You could try the below regex to match the strings that starts with " (which was not preceded by \ symbol) upto the next " symbol which also not preceded by \

(?<!\\)".*?(?<!\\)"

DEMO

>>> s = r'"a string" ... "another \"string\"" ... "yet another \"string" ... "failed string\"'
>>> m = re.findall(r'".*?[^\\]"', s)
>>> m
['"a string"', '"another \\"string\\""', '"yet another \\"string"']
>>> m = re.findall(r'".*?(?<!\\)"', s)
>>> m
['"a string"', '"another \\"string\\""', '"yet another \\"string"']
>>> m = re.findall(r'(?<!\\)".*?(?<!\\)"', s)
>>> m
['"a string"', '"another \\"string\\""', '"yet another \\"string"']

UPDATE:

>>> s = r'"a string" ... \\"another \"string\"\\" ... "yet another \"string" ... "failed string\" '
>>> m = re.findall(r'(?<!\\)".*?(?<!\\)"|(?<=\\\\)".*?\\\\"', s)
>>> m
['"a string"', '"another \\"string\\"\\\\"', '"yet another \\"string"']
>>> for i in m:
...     print i
... 
"a string"
"another \"string\"\\"
"yet another \"string"

DEMO

Avinash Raj
  • 172,303
  • 28
  • 230
  • 274
  • That works to an extent but I added a second case that I had forgot about before, that being the literal backslash character before quotes. Please refer the the OP and if you could help out that would be great. – globby Aug 31 '14 at 01:56
  • @globby updated... it's the python behaviour of escaping the backslash one more time. – Avinash Raj Aug 31 '14 at 03:48
  • Actually, I figured it out myself. Thanks. Used `(?<!(?<!\\)\\)".*?(?<!(?<!\\)\\)"` – globby Aug 31 '14 at 04:27
  • I don't believe that is relevant, due to the fact that I didn't need it to match that case ;p – globby Aug 31 '14 at 05:47
  • do you have any further problem?both regexes matches the same set of words i think.. – Avinash Raj Aug 31 '14 at 05:48
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/60324/discussion-between-avinash-raj-and-globby). – Avinash Raj Aug 31 '14 at 05:50
0

You can use this regex:

"[\w\s\\"]+(?<!\\)"

Working demo

enter image description here

Edit: I noticed you updated your input sample. For the updated input, you can use this regex:

(?:\\\\"|")[\w\s\\"]+(?:\\\\"|(?<!\\)")

Working demo

enter image description here

Federico Piazza
  • 30,085
  • 15
  • 87
  • 123
0

A way that emulate an atomic group (that is interesting to reduce the backtracking when the pattern must fail):

re.findall(r'"(?=((?:[^"\\]+|\\.)*))\1"', s)

demo

Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
0
("[^...]*?")(?=\s*\.\.\.|$)

You can try this.

See demo.Works correctly to give the required answer.

http://regex101.com/r/bJ6rZ5/4

vks
  • 67,027
  • 10
  • 91
  • 124