1

I am new to regex and the re module in python. I need to find a way to parse a string such as:

Hello world "Boston Red Sox", 'Pepperoni Pizza', 'Cheese Pizza's', beer

into a list such as:

['Hello', 'world', 'Boston Red Sox', 'Pepperoni Pizza', 'Cheese Pizza's', 'beer']

I need to omit the outer quotes from the final list, but preserve ' and " inside the phrases if they exist. I.e. "Cheese Pizza's" = Cheese Pizza's

I am aware of this post: Regex for splitting a string using space when not surrounded by single or double quotes however I am having trouble translating the regex into a pattern that python re can understand.

Thank you

NatB
  • 13
  • 4

1 Answers1

0

Use

import re
string = '''Hello world "Boston Red Sox", 'Pepperoni Pizza', 'Cheese Pizza's', beer'''
string = re.sub(r"(^|,\s*)'|'(?=\s*,)", r'\1"', string)
print([f"{x}{y}{z}" for x,y,z in re.findall(r"""'([^']*)'|"([^"]*)"|([^\s,]+)""", string)])

See Python proof.

Results: ['Hello', 'world', 'Boston Red Sox', 'Pepperoni Pizza', "Cheese Pizza's", 'beer']

EXPLANATION (REGEX #1)

--------------------------------------------------------------------------------
  (                        group and capture to \1:
--------------------------------------------------------------------------------
    ^                        the beginning of the string
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    ,                        ','
--------------------------------------------------------------------------------
    \s*                      whitespace (\n, \r, \t, \f, and " ") (0
                             or more times (matching the most amount
                             possible))
--------------------------------------------------------------------------------
  )                        end of \1
--------------------------------------------------------------------------------
  '                        '\''
--------------------------------------------------------------------------------
 |                        OR
--------------------------------------------------------------------------------
  '                        '\''
--------------------------------------------------------------------------------
  (?=                      look ahead to see if there is:
--------------------------------------------------------------------------------
    \s*                      whitespace (\n, \r, \t, \f, and " ") (0
                             or more times (matching the most amount
                             possible))
--------------------------------------------------------------------------------
    ,                        ','
--------------------------------------------------------------------------------
  )                        end of look-ahead

EXPLANATION (REGEX #2)

--------------------------------------------------------------------------------
  '                        '\''
--------------------------------------------------------------------------------
  (                        group and capture to \1:
--------------------------------------------------------------------------------
    [^']*                    any character except: ''' (0 or more
                             times (matching the most amount
                             possible))
--------------------------------------------------------------------------------
  )                        end of \1
--------------------------------------------------------------------------------
  '                        '\''
--------------------------------------------------------------------------------
 |                        OR
--------------------------------------------------------------------------------
  "                        '"'
--------------------------------------------------------------------------------
  (                        group and capture to \2:
--------------------------------------------------------------------------------
    [^"]*                    any character except: '"' (0 or more
                             times (matching the most amount
                             possible))
--------------------------------------------------------------------------------
  )                        end of \2
--------------------------------------------------------------------------------
  "                        '"'
--------------------------------------------------------------------------------
 |                        OR
--------------------------------------------------------------------------------
  (                        group and capture to \3:
--------------------------------------------------------------------------------
    [^\s,]+                  any character except: whitespace (\n,
                             \r, \t, \f, and " "), ',' (1 or more
                             times (matching the most amount
                             possible))
--------------------------------------------------------------------------------
  )                        end of \3
Ryszard Czech
  • 18,032
  • 4
  • 24
  • 37