Using ast.literal_eval(s)
works great as long as there are no unescaped quotes present. We can fix that using regex. Unfortunately, it's not that easy. However, I got a silver bullet for you.
First, let's identify all quoted substrings that have inner quotes of the same type:
(?<=[\[,])\s*(['\"])(?:(\1)|.)*?\1(?=\s*[,\]])
Explanation
(?<=[\[,])
front anchor using lookbehind: beginning of a list or previous list element: [
or ,
\s*
zero or more whitespace
(['\"])
opening quote in capture group
(?:(\1)|.)*?
lazy match any character, if same quote as in $1
put it in $2
\1
closing quote using a back reference
(?=\s*[,\]])
rear anchor using lookahead: ,
or ]
Demo
Now, the real fun starts. Next, search substrings that have inner quotes of the same type for unescaped quotes.
To do so, I use the Skip_what_to_avoid approach with some advanced regex technique like an unrolled loop.
^"[^"\\]*(?:\\.[^\"\\]*)*"$|^'[^'\\]*(?:\\.[^'\\]*)*'$|^\s*["'][^"'\\]*|["']\s*$|\\.|(["'])[^"'\\\n]*
The key idea here is to completely disregard the overall matches returned by the regex engine: $0
is the trash bin. Instead, we only need to check capture group $1
, which, when set, contains what we are looking for. Basically, all these uncaptured alternations match regular, properly escaped substrings enclosed in quotation marks. We only want the unescaped ones.
Demo2
I decided to reassemble the original string with the necessary \
added to properly escape those unescaped quotes. There are certainly other ways.
Finally, putting all together the full sample code (online demo):
import re
import ast
test = ("[ \"A\", \"\"B\"\",'C' , \" D\"]\n"
"[ \"A\", \"'B'\",'C' , \" D\"]\n"
"[ \"A\", ''B'','C' , \" D\"]\n"
"[ \"A\", '\"B\"','C' , \" D\"]\n"
"[ \"A\", '8 o'clock','C' , \" D\"]\n"
"[ \"A\", \"Ol' 8 o'clock\",'C' , \" D\"]\n"
"[\"Some Text\"]\n"
"[\"Some more Text\"]\n"
"[\"Even more text about \\\"this text\\\"\"]\n"
"[\"Ol' 8 o'clock\"]\n"
"['8 o'clock']\n"
"[ '8 o'clock']\n"
"['Ol' 8 o'clock']\n"
"[\"\"B\"]\n"
"[\"\\\"B\"]\n"
"[\"\\\\\"B\"]\n"
"[\"\\\\\\\"B\"]\n"
"[\"\\\\\\\\\"B\"]")
result = u''
last_index = 0
regex1 = r"(?<=[\[,])\s*(['\"])(?:(\1)|.)*?\1(?=\s*[,\]])" #nested quotes of the same type
regex2 = r'''^"[^"\\]*(?:\\.[^\"\\]*)*"$|^'[^'\\]*(?:\\.[^'\\]*)*'$|^\s*["'][^"'\\]*|["']\s*$|\\.|(["'])[^"'\\\n]*''' # unescaped quotes in $1
matches = re.finditer(regex1, test, re.MULTILINE)
for match in matches:
if match.groups()[1] is not None: #nested quotes of the same type present
print(match.group())
inner = re.finditer(regex2, match.group())
for m in inner:
if m.groups()[0] is not None: # unescaped quotes in $1 present
result += test[last_index:match.start() + m.start()] + '\\' + m.group()
last_index = match.start()+m.end()
result += test[last_index:len(test)]
print(result)
for test_str in result.split("\n"):
List = ast.literal_eval(test_str)
print(List)
print(type(List))
We have test string that contains several lists as a string literal with all sorts of special cases: different and same types of quotes, escaped quotes, escaped escape sequences, etc. The test string is cleansed using the outlined solution above, split
, and then individually converted to a list using literal_eval
.
This is probably one of the most advanced regex solutions I have crafted so far.