Python : transform complex list-like string into list (without SyntaxError for mixing bytes and nonbytes literal)

Question

I have this string : u'''[ "A", """B""",'C' , " D"]'''

I would like to get a list like : ["A","""B""",'C'," D"] *edited as ""B"" was not valid indeed

Some use "" "" or ' ' to escape ' or " or both inside.

I tried various solutions from : Convert string representation of list to list

s= u'''[ "A", """B""",'C' , " D"]'''

1)

import numpy as np
List = np.array(s).tolist()
print(List)
print(type(List))

Result :

[ "A", """B""",'C' , " D"]
<class 'str'>

So not a list, then :

2)

import ast
List = ast.literal_eval(s)
print(List)
print(type(List))

Result

SyntaxError: cannot mix bytes and nonbytes literals

Any other idea how to do it ? (I m using Python 3.6)

I thought of regex or simple string replacements but when I think of all the cases where we escape replacing characters like ", ' in the list elements it gets complicated real fast

@bobrobbob Yeah and if it was valid like `u'''[ "A", "B",'C' , " D"]'''` you could just do `eval(s)` — U13-Forward, Jun 09 '18 at 09:47

wp78de · Answer 1 · 2018-06-10T23:20:02.213

Using ast.literal_eval(s) works great as long as there are no unescaped quotes present. We can fix that using regex. Unfortunately, it's not that easy. However, I got a silver bullet for you.

First, let's identify all quoted substrings that have inner quotes of the same type:

(?<=[\[,])\s*(['\"])(?:(\1)|.)*?\1(?=\s*[,\]])

Explanation

(?<=[\[,]) front anchor using lookbehind: beginning of a list or previous list element: [ or ,
\s* zero or more whitespace
(['\"]) opening quote in capture group
(?:(\1)|.)*? lazy match any character, if same quote as in $1 put it in $2
\1 closing quote using a back reference
(?=\s*[,\]]) rear anchor using lookahead: , or ]

Demo

Now, the real fun starts. Next, search substrings that have inner quotes of the same type for unescaped quotes.

To do so, I use the Skip_what_to_avoid approach with some advanced regex technique like an unrolled loop.

^"[^"\\]*(?:\\.[^\"\\]*)*"$|^'[^'\\]*(?:\\.[^'\\]*)*'$|^\s*["'][^"'\\]*|["']\s*$|\\.|(["'])[^"'\\\n]*

The key idea here is to completely disregard the overall matches returned by the regex engine: $0 is the trash bin. Instead, we only need to check capture group $1, which, when set, contains what we are looking for. Basically, all these uncaptured alternations match regular, properly escaped substrings enclosed in quotation marks. We only want the unescaped ones.

Demo2

I decided to reassemble the original string with the necessary \ added to properly escape those unescaped quotes. There are certainly other ways.

Finally, putting all together the full sample code (online demo):

import re
import ast

test = ("[ \"A\", \"\"B\"\",'C' , \" D\"]\n"
    "[ \"A\", \"'B'\",'C' , \" D\"]\n"
    "[ \"A\", ''B'','C' , \" D\"]\n"
    "[ \"A\", '\"B\"','C' , \" D\"]\n"
    "[ \"A\", '8 o'clock','C' , \" D\"]\n"
    "[ \"A\", \"Ol' 8 o'clock\",'C' , \" D\"]\n"
    "[\"Some Text\"]\n"
    "[\"Some more Text\"]\n"
    "[\"Even more text about \\\"this text\\\"\"]\n"
    "[\"Ol' 8 o'clock\"]\n"
    "['8 o'clock']\n"
    "[ '8 o'clock']\n"
    "['Ol' 8 o'clock']\n"
    "[\"\"B\"]\n"
    "[\"\\\"B\"]\n"
    "[\"\\\\\"B\"]\n"
    "[\"\\\\\\\"B\"]\n"
    "[\"\\\\\\\\\"B\"]")
result = u''
last_index = 0
regex1 = r"(?<=[\[,])\s*(['\"])(?:(\1)|.)*?\1(?=\s*[,\]])" #nested quotes of the same type
regex2 = r'''^"[^"\\]*(?:\\.[^\"\\]*)*"$|^'[^'\\]*(?:\\.[^'\\]*)*'$|^\s*["'][^"'\\]*|["']\s*$|\\.|(["'])[^"'\\\n]*''' # unescaped quotes in $1
matches = re.finditer(regex1, test, re.MULTILINE)
for match in matches:
    if match.groups()[1] is not None: #nested quotes of the same type present
        print(match.group())
        inner = re.finditer(regex2, match.group())
        for m in inner:            
            if m.groups()[0] is not None: # unescaped quotes in $1 present
                result += test[last_index:match.start() + m.start()] + '\\' + m.group()
                last_index = match.start()+m.end()
result += test[last_index:len(test)]
print(result)

for test_str in result.split("\n"):
    List = ast.literal_eval(test_str)
    print(List)
    print(type(List))

We have test string that contains several lists as a string literal with all sorts of special cases: different and same types of quotes, escaped quotes, escaped escape sequences, etc. The test string is cleansed using the outlined solution above, split, and then individually converted to a list using literal_eval.

This is probably one of the most advanced regex solutions I have crafted so far.

Python : transform complex list-like string into list (without SyntaxError for mixing bytes and nonbytes literal)

1 Answers1