How to prevent Python strings being split() if the delimiter is surrounded by brackets

Question

I am working on a bit of code that loops through a txt file and creates a list containing individual lines. I need specific content from each line, where a comma is used as a delimiter. However, I run into an issue when there is a comma in one of the list items. The list comprehension line separates this single item into two items. The item, author, is enclosed in brackets. Can I have the list comprehension overlook items contained in brackets perhaps?

    inventory = open("inventory.txt").readlines()
    seperated_inventory = [x.split(",") for x in inventory]
    isbn_list = [item[0] for item in seperated_inventory]
    author_list = [item[1] for item in seperated_inventory]
    title_list = [item[2] for item in seperated_inventory]
    category_list = [item[3] for item in seperated_inventory]
    active_list = [item[4] for item in seperated_inventory]

example of line with two authors

0520085477,['Richard L. Abel', 'Philip Simon Coleman Lewis'],Lawyers in Society,['Law’],False

Possible duplicate of [regex, extract string NOT between two brackets](https://stackoverflow.com/questions/19414193/regex-extract-string-not-between-two-brackets) — supersam654, Jan 10 '18 at 03:14
That may well be a duplicate. I interpreted this question as asking about how to *use simple delimited parsing (not regex) to exclude certain elements including parts of the delimiter*, or to *recursively parse* (at least to depth 1) sub-arrays embedded in an outer data structure. However, that interpretation may be incorrect, in which case this is indeed a dupe. — Zac B, Jan 10 '18 at 03:42

score 1 · Answer 1 · answered Jan 10 '18 at 03:40

I think using a single-character split is not a great strategy when you have sub-lists that can contain the character you're splitting on.

There are three main ways you can approach this (that I've thought of) . . . well, two ways and an alternative:

Option 1: stick with `split(',')` and re-join sub-arrays.

This is reasonably brittle, lengthy, and inferior to the second approach. I'm putting it first because it directly answers the question, not because it's what you should do:

line="0520085477,['Richard L. Abel', 'Philip Simon Coleman Lewis'],Lawyers in Society,['Law’],False"

# Index of the left hand side of any found sub-arrays.
left = 0
# Iterator position, also used as the index of the right hand side of any found sub-arrays.
right = 0
array = line.split(',')
while right < len(array):
    if array[right].startswith('['):
        array[right] = array[right][1:] # Remove the bracket
        left = right

    if array[right].endswith(']'):
        array[right] = array[right][:-1] # Remove the bracket
        # Pull the stuff between brackets out into a sub-array, and then
        # replace that segment of the original array with a single element
        # which is the sub-array.
        array[left:right+1] = [array[left:right+1]]
        # Preserve the "leading search position", since we just changed
        # the size of the array.
        right = left
    right += 1

print(array)

As you can see, that code is much less legible than a comprehension. It's also complex; it probably has bugs and edge cases I did not test.

This will only work with a single level of nested sub-arrays.

Option 2: Regex

Despite what xkcd says about regex, in this case it is a much clearer and simpler solution to extracting sub-arrays. More information on how to use regex can be found in the documentation for the re module. Online regex testers are also readily available, and are a great help when debugging regular expressions.

import re

line="0520085477,['Richard L. Abel', 'Philip Simon Coleman Lewis'],Lawyers in Society,['Law’],False"

r = re.compile(r'(?:\[(?P<nested>.*?)\]|(?P<flat>[^,]+?)),')
array = []
# For each matched part of the line, figure out if we matched a
# sub-array (in which case, split it on comma and add the resulting
# array to the final list) or a normal item (just add it to the final
# list).

# We append a comma to the string we search so our regex always matches
# the last element.
for match in r.finditer(line + ","):
    if match.group('nested'): # It's a sub-array
        array.append(match.group('nested').split(","))
    else: # It's a normal top-level element
        array.append(match.group('flat'))

print(array)

The regex says, roughly:

Start a non-capturing group (?:) that wraps the two sub-patterns. Just like parentheses forcing the order of operations in a math formula, this makes it explicit that the trailing comma at the end of this regex is not part of either capturing group. It's not strictly necessary, but makes things clearer.
Match one of two groups. The first group is some characters between a pair of square brackets, ignoring commas and splitting. The match should be done lazily (stop as soon as a closing bracket is seen; that's the ?), and anything in the match should be made available to the regex API with the name "nested". The name is totally optional; array indexes on the match object could be used just as well, but this is more explicit for code readers.
The second group that could be matched is some characters that do not contain a comma ([^,]). Depending on the eagerness of the regex engine, you could potentially replace this with "any character", and trust the comma outside of the outer non-capturing ?: group would prevent these matches from running away, but saying "not comma" is more explicit for readers. Anything that matches this group should be stored with the name "flat".
Lastly, look for a comma following occurrences of either of those patterns. Since the last element in the array isn't followed by a comma, I just kludge and match against the line plus one additional comma rather than further complicate the regex.

Once the regex is understood, the rest is simple: loop through each match, see if it was "flat" or "nested", and if it was nested, split it based on comma and add that as a sub-array to the result.

This will not work with more than a single level of nested sub-arrays, and will break/do unexpected things if commas end up adjacent to each other or if a sub-array isn't "closed" (malformed input, basically), which brings me to . . .

Option 3: Use a structured data format

Both of those parsers are prone to errors. Elements in your arrays could contain special characters (e.g. what if a title like this had a square bracket as part of its name?), multiple commas could appear around fields that are "empty", you could need multiple-levels of nested sub-arrays (you can make either of the first two options recursive, but the code will just get that much harder to read), or, perhaps most commonly, you could be handed input that's slightly broken/not compliant with what you expect, and have to parse it anyway.

Dealing with all of those issues can be accomplished with more code, but that code typically makes the parsing system less reliable, not more.

Instead, consider switching your data interchange format to be something like JSON. The line you supplied is already nearly valid JSON already, so you might be able to just use the json Python module directly and have things "just work" without needing to write a single line of parsing code. There are many other options for structured data parsing, including YAML and TOML. Anything you choose in that area will likely be more robust than rolling parsing logic by hand.

Of course, if this is for fun/education and you want to make something from scratch, code away! Parsers are an excellent educational project, since there are a lot of corner cases, but each corner case tends to be discrete/interact only minimally with other weird cases.

I've been pondering about this for a while and my first thought was Option 3 as well. However I wonder if it was a restriction for OP (possibly a vendor/3rd party created file) that they must parse through. — r.ook, Jan 10 '18 at 03:56

How to prevent Python strings being split() if the delimiter is surrounded by brackets

1 Answers1

Option 1: stick with split(',') and re-join sub-arrays.

Option 2: Regex

Option 3: Use a structured data format

Option 1: stick with `split(',')` and re-join sub-arrays.