I think using a single-character split
is not a great strategy when you have sub-lists that can contain the character you're splitting on.
There are three main ways you can approach this (that I've thought of) . . . well, two ways and an alternative:
Option 1: stick with split(',')
and re-join sub-arrays.
This is reasonably brittle, lengthy, and inferior to the second approach. I'm putting it first because it directly answers the question, not because it's what you should do:
line="0520085477,['Richard L. Abel', 'Philip Simon Coleman Lewis'],Lawyers in Society,['Law’],False"
# Index of the left hand side of any found sub-arrays.
left = 0
# Iterator position, also used as the index of the right hand side of any found sub-arrays.
right = 0
array = line.split(',')
while right < len(array):
if array[right].startswith('['):
array[right] = array[right][1:] # Remove the bracket
left = right
if array[right].endswith(']'):
array[right] = array[right][:-1] # Remove the bracket
# Pull the stuff between brackets out into a sub-array, and then
# replace that segment of the original array with a single element
# which is the sub-array.
array[left:right+1] = [array[left:right+1]]
# Preserve the "leading search position", since we just changed
# the size of the array.
right = left
right += 1
print(array)
As you can see, that code is much less legible than a comprehension. It's also complex; it probably has bugs and edge cases I did not test.
This will only work with a single level of nested sub-arrays.
Option 2: Regex
Despite what xkcd says about regex, in this case it is a much clearer and simpler solution to extracting sub-arrays. More information on how to use regex can be found in the documentation for the re
module. Online regex testers are also readily available, and are a great help when debugging regular expressions.
import re
line="0520085477,['Richard L. Abel', 'Philip Simon Coleman Lewis'],Lawyers in Society,['Law’],False"
r = re.compile(r'(?:\[(?P<nested>.*?)\]|(?P<flat>[^,]+?)),')
array = []
# For each matched part of the line, figure out if we matched a
# sub-array (in which case, split it on comma and add the resulting
# array to the final list) or a normal item (just add it to the final
# list).
# We append a comma to the string we search so our regex always matches
# the last element.
for match in r.finditer(line + ","):
if match.group('nested'): # It's a sub-array
array.append(match.group('nested').split(","))
else: # It's a normal top-level element
array.append(match.group('flat'))
print(array)
The regex says, roughly:
- Start a non-capturing group (
?:
) that wraps the two sub-patterns. Just like parentheses forcing the order of operations in a math formula, this makes it explicit that the trailing comma at the end of this regex is not part of either capturing group. It's not strictly necessary, but makes things clearer.
- Match one of two groups. The first group is some characters between a pair of square brackets, ignoring commas and splitting. The match should be done lazily (stop as soon as a closing bracket is seen; that's the
?
), and anything in the match should be made available to the regex API with the name "nested". The name is totally optional; array indexes on the match object could be used just as well, but this is more explicit for code readers.
- The second group that could be matched is some characters that do not contain a comma (
[^,]
). Depending on the eagerness of the regex engine, you could potentially replace this with "any character", and trust the comma outside of the outer non-capturing ?:
group would prevent these matches from running away, but saying "not comma" is more explicit for readers. Anything that matches this group should be stored with the name "flat".
- Lastly, look for a comma following occurrences of either of those patterns. Since the last element in the array isn't followed by a comma, I just kludge and match against the line plus one additional comma rather than further complicate the regex.
Once the regex is understood, the rest is simple: loop through each match, see if it was "flat" or "nested", and if it was nested, split it based on comma and add that as a sub-array to the result.
This will not work with more than a single level of nested sub-arrays, and will break/do unexpected things if commas end up adjacent to each other or if a sub-array isn't "closed" (malformed input, basically), which brings me to . . .
Option 3: Use a structured data format
Both of those parsers are prone to errors. Elements in your arrays could contain special characters (e.g. what if a title like this had a square bracket as part of its name?), multiple commas could appear around fields that are "empty", you could need multiple-levels of nested sub-arrays (you can make either of the first two options recursive, but the code will just get that much harder to read), or, perhaps most commonly, you could be handed input that's slightly broken/not compliant with what you expect, and have to parse it anyway.
Dealing with all of those issues can be accomplished with more code, but that code typically makes the parsing system less reliable, not more.
Instead, consider switching your data interchange format to be something like JSON. The line you supplied is already nearly valid JSON already, so you might be able to just use the json
Python module directly and have things "just work" without needing to write a single line of parsing code. There are many other options for structured data parsing, including YAML and TOML. Anything you choose in that area will likely be more robust than rolling parsing logic by hand.
Of course, if this is for fun/education and you want to make something from scratch, code away! Parsers are an excellent educational project, since there are a lot of corner cases, but each corner case tends to be discrete/interact only minimally with other weird cases.