2

I have a CSV string where some of the items might be enclosed by {} with commas inside. I wanted to collect the string values in a list.

What is the most pythonic way to collect the values in a list?

Example 1: 'a,b,c', expected output ['a', 'b', 'c']

Example 2: '{aa,ab}, b, c', expected output ['{aa,ab}','b','c']

Example 3: '{aa,ab}, {bb,b}, c', expected output ['{aa,ab}', '{bb,b}', 'c']

I have tried to work with s.split(','), it works for example 1 but will mess up for case 2 and 3.

I believe that this question (How to split but ignore separators in quoted strings, in python?) is very similar to my problem. But I can't figure out the proper regex syntax to use.

rph
  • 901
  • 1
  • 10
  • 26

4 Answers4

6

The solution is very similar in fact:

import re
PATTERN = re.compile(r'''\s*((?:[^,{]|\{[^{]*\})+)\s*''')
data = '{aa,ab}, {bb,b}, c'
print(PATTERN.split(data)[1::2])

will give:

['{aa,ab}', '{bb,b}', 'c']
Marco Pantaleoni
  • 2,529
  • 15
  • 14
3

A more readable way (at least to me) is to explain what you are looking for: either something between brackets { } or something that only contains alphanumeric characters:

import re 

examples = [
  'a,b,c',
  '{aa,ab}, b, c',
  '{aa,ab}, {bb,b}, c'
]

for example in examples:
  print(re.findall(r'(\{.+?\}|\w+)', example))

It prints

['a', 'b', 'c']
['{aa,ab}', 'b', 'c']
['{aa,ab}', '{bb,b}', 'c']
Guybrush
  • 2,680
  • 1
  • 10
  • 17
1

Note that it is not necessary to use a regex, you can just use pure Python:

s = '{aa,ab}, {bb,b}, c'
commas = [i for i, c in enumerate(s) if c == ',' and \
                                             s[:i].count('{') == s[:i].count('}')]
[s[2:b] for a, b in zip([-2] + commas, commas + [None])]
#['{aa,ab}', '{bb,b}', 'c']
Joe Iddon
  • 20,101
  • 7
  • 33
  • 54
0

A more simple pure python approach replacing {} with "":

def parseCSV(string):

    results = []
    current = ''
    quoted = False
    quoting = False


    for i in range(0, len(string)):
        currentletter = string[i]

        if currentletter == '"':
            if quoted == True:
                if quoting == True:
                    current = current + currentletter
                    quoting = False 
                else:
                    quoting = True

            else:
                quoted = True
                quoting = False

        else:

            shouldCheck  = False

            if quoted == True:

                if quoting == True:
                    quoted = False
                    quoting = False

                    shouldCheck = True

                else:
                    current = current + currentletter

            else:
                shouldCheck = True

            if shouldCheck == True:
                if currentletter == ',':
                    results.append(current)
                    current = ''

                else:
                    current = current +  currentletter

    results.append(current)
    return results