Extracting substrings between single quotes

Question

I am new in python and trying to extract substrings between single quotes. Do you know how to do this with regex?

E.G input

 text = "[(u'apple',), (u'banana',)]"

I want to extract apple and banana as list items like ['apple', 'banana']

Why do you want to do this? This smells like an [XY problem](http://meta.stackexchange.com/questions/66377/what-is-the-xy-problem). — Kevin, Mar 19 '15 at 19:00
Pre-emptive note to potential answerers: if you give a solution using regex, make sure that it works on tricky strings like `"[(u'this string contains\' an escaped quote mark and\\ an escaped slash',)]"` — Kevin, Mar 19 '15 at 19:01
You can try a non greedy regex, `'.*?'` but this does not work with the conditions that Kevin has mentioned. However it works fine with the sample input you have provided — Bhargav Rao, Mar 19 '15 at 19:08

Wiktor Stribiżew · Answer 1 · 2018-04-18T17:23:53.053

In the general case, to extract any chars in between single quotes, the most efficient regex approach is

re.findall(r"'([^']*)'", text) # to also extract empty values
re.findall(r"'([^']+)'", text) # to only extract non-empty values

See the regex demo.

Details

' - a single quote (no need to escape inside a double quote string literal)
([^']*) - a capturing group that captures any 0+ (or 1+ if you use + quantifier) chars other than ' (the [^...] is a negated character class that matches any chars other than those specified in the class)
' - a closing single quote.

Note that re.findall only returns captured substrings if capturing groups are specified in the pattern:

If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.

Python demo:

import re
text = "[(u'apple',), (u'banana',)]"
print(re.findall(r"'([^']*)'", text))
# => ['apple', 'banana']

Escaped quote support

If you need to support escaped quotes (so as to match abc\'def in 'abc\'def' you will need a regex like

re.findall(r"'([^'\\]*(?:\\.[^'\\]*)*)'", text, re.DOTALL) # in case the text contains only "valid" pairs of quotes
re.findall(r"(?<!\\)(?:\\\\)*'([^'\\]*(?:\\.[^'\\]*)*)'", text, re.DOTALL) # if your text is too messed up and there can be "wild" single quotes out there

See regex variation 1 and regex variation 2 demos.

Pattern details

(?<!\\) - a negative lookbehind that fails the match if there is a backslash immediately to the left of the current position
(?:\\\\)* - 0 or more consecutive double backslashes (since these are not escaping the neighboring character)
' - an open '
([^'\\]*(?:\\.[^'\\]*)*) - Group 1 (what will be returned by re.findall)matching...
- [^'\\]* - 0 or more chars other than ' and \
- (?: - start of a non-capturing group that matches
  - \\. - any escaped char (a backslash and any char including line breaks due to the re.DOTALL modifier)
  - [^'\\]* - 0 or more chars other than ' and \
)* - ... zero or more times
' - a closing '.

See another Python demo:

import re
text = r"[(u'apple',), (u'banana',)] [(u'apple',), (u'banana',), (u'abc\'def',)] \\'abc''def' \\\'abc   'abc\\\\\'def'"
print(re.findall(r"(?<!\\)(?:\\\\)*'([^'\\]*(?:\\.[^'\\]*)*)'", text))
# => apple, banana, apple, banana, abc\'def, abc, def, abc\\\\\'def

score 3 · Accepted Answer · answered Mar 19 '15 at 19:22

text = "[(u'apple',), (u'banana',)]"   

print(re.findall(r"\(u'(.*?)',\)", text)
['apple', 'banana']

text = "[(u'this string contains\' an escaped quote mark and\\ an escaped slash',)]"
print(re.findall(r"\(u'(.*?)',\)", text)[0])
this string contains' an escaped quote mark and \ an escaped slash

score 2 · Answer 3 · answered Mar 19 '15 at 19:03

You may alternatively use ast.literal_eval then extract the first item by list comprehension:

from ast import literal_eval

text = "[(u'apple',), (u'banana',)]"

literal_eval(text)
Out[3]: [(u'apple',), (u'banana',)]

[t[0] for t in literal_eval(text)]
Out[4]: [u'apple', u'banana']

Extracting substrings between single quotes

3 Answers3