I am new in python and trying to extract substrings between single quotes. Do you know how to do this with regex?
E.G input
text = "[(u'apple',), (u'banana',)]"
I want to extract apple and banana as list items like ['apple', 'banana']
I am new in python and trying to extract substrings between single quotes. Do you know how to do this with regex?
E.G input
text = "[(u'apple',), (u'banana',)]"
I want to extract apple and banana as list items like ['apple', 'banana']
In the general case, to extract any chars in between single quotes, the most efficient regex approach is
re.findall(r"'([^']*)'", text) # to also extract empty values
re.findall(r"'([^']+)'", text) # to only extract non-empty values
See the regex demo.
Details
'
- a single quote (no need to escape inside a double quote string literal)([^']*)
- a capturing group that captures any 0+ (or 1+ if you use +
quantifier) chars other than '
(the [^...]
is a negated character class that matches any chars other than those specified in the class)'
- a closing single quote.Note that re.findall
only returns captured substrings if capturing groups are specified in the pattern:
If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.
import re
text = "[(u'apple',), (u'banana',)]"
print(re.findall(r"'([^']*)'", text))
# => ['apple', 'banana']
Escaped quote support
If you need to support escaped quotes (so as to match abc\'def
in 'abc\'def'
you will need a regex like
re.findall(r"'([^'\\]*(?:\\.[^'\\]*)*)'", text, re.DOTALL) # in case the text contains only "valid" pairs of quotes
re.findall(r"(?<!\\)(?:\\\\)*'([^'\\]*(?:\\.[^'\\]*)*)'", text, re.DOTALL) # if your text is too messed up and there can be "wild" single quotes out there
See regex variation 1 and regex variation 2 demos.
Pattern details
(?<!\\)
- a negative lookbehind that fails the match if there is a backslash immediately to the left of the current position(?:\\\\)*
- 0 or more consecutive double backslashes (since these are not escaping the neighboring character) '
- an open '
([^'\\]*(?:\\.[^'\\]*)*)
- Group 1 (what will be returned by re.findall
)matching...
[^'\\]*
- 0 or more chars other than '
and \
(?:
- start of a non-capturing group that matches
\\.
- any escaped char (a backslash and any char including line breaks due to the re.DOTALL
modifier)[^'\\]*
- 0 or more chars other than '
and \
)*
- ... zero or more times'
- a closing '
.See another Python demo:
import re
text = r"[(u'apple',), (u'banana',)] [(u'apple',), (u'banana',), (u'abc\'def',)] \\'abc''def' \\\'abc 'abc\\\\\'def'"
print(re.findall(r"(?<!\\)(?:\\\\)*'([^'\\]*(?:\\.[^'\\]*)*)'", text))
# => apple, banana, apple, banana, abc\'def, abc, def, abc\\\\\'def
text = "[(u'apple',), (u'banana',)]"
print(re.findall(r"\(u'(.*?)',\)", text)
['apple', 'banana']
text = "[(u'this string contains\' an escaped quote mark and\\ an escaped slash',)]"
print(re.findall(r"\(u'(.*?)',\)", text)[0])
this string contains' an escaped quote mark and \ an escaped slash
You may alternatively use ast.literal_eval
then extract the first item by list comprehension:
from ast import literal_eval
text = "[(u'apple',), (u'banana',)]"
literal_eval(text)
Out[3]: [(u'apple',), (u'banana',)]
[t[0] for t in literal_eval(text)]
Out[4]: [u'apple', u'banana']