0

Does anyone know how to a regular expression in python to get everything in between a quotation marks?

For example, text: "some text here".... text: "more text in here!"... text:"and some numbers - 2343- here too"

The text are of different length, and some contain punctuation and numbers as well. How do I write a regular expression to extract all the information?

what I would like to see in the compiler:

some text here more text in here and some numbers - 2343 - here too

3 Answers3

7

This should work for you:

"(.*?)"

Placing a ? after the * will restrict it to match as little as possible, so it doesn't consume any quote marks.

>>> r = '"(.*?)"'
>>> s =  'text: "some text here".... text: "more text in here!"... text:"and some numbers - 2343- here too"'
>>> import re
>>> re.findall(r, s)
['some text here', 'more text in here!', 'and some numbers - 2343- here too']
Karl Barker
  • 11,095
  • 3
  • 21
  • 26
7

Try "[^"]*" that is, " followed by zero or more items that aren't ", followe by ". So:

pat = re.compile(r'"[^"]*"').
Qtax
  • 33,241
  • 9
  • 83
  • 121
Pierce
  • 564
  • 2
  • 8
1

If the quoted sub-strings to be matched do NOT contain escaped characters, then both Karl Barker's and Pierce's answers will both match correctly. However, of the two, Pierce's expression is more efficient:

reobj = re.compile(r"""
    # Match double quoted substring (no escaped chars).
    "                   # Match opening quote.
    (                   # $1: Quoted substring contents.
      [^"]*             # Zero or more non-".
    )                   # End $1: Quoted substring contents.
    "                   # Match closing quote.
    """, re.VERBOSE)

But if the quoted sub-string to be matched DOES contain escaped characters, (e.g. "She said: \"Hi\" to me.\n"), then you'll need a different expression:

reobj = re.compile(r"""
    # Match double quoted substring (allow escaped chars).
    "                   # Match opening quote.
    (                   # $1: Quoted substring contents.
      [^"\\]*           # {normal} Zero or more non-", non-\.
      (?:               # Begin {(special normal*)*} construct.
        \\.             # {special} Escaped anything.
        [^"\\]*         # more {normal} Zero or more non-", non-\.
      )*                # End {(special normal*)*} construct.
    )                   # End $1: Quoted substring contents.
    "                   # Match closing quote.
    """, re.DOTALL | re.VERBOSE)

There are several expressions I'm aware of that will do the trick, but the one above (taken from MRE3) is the most efficient of the bunch. See my answer to a similar question where these various, functionally identical expressions are compared.

Community
  • 1
  • 1
ridgerunner
  • 33,777
  • 5
  • 57
  • 69