0

I have a text file that I want to parse strings from. The thing is that there are strings enclosed in either single ('), double (") or 3x single (''') quotes within the exact same file. The best result I was able to get so far is to use this:

((?<=["])(.*?)(?=["]))|((?<=['])(.*?)(?=[']))

to match only single-line strings between single and double quotes. Please note that the strings in the file are enclosed in each type of quotes can be either single- or multi-line and that each type of string repeats several times within the file.

Here's a sample string:

<thisisthefirststring
'''- This is the first line of text
- This is the second line of text
- This is the third line of text
'''
>

<thisisanotheroption
"Just a string between quotes"
>

<thisisalsopossible
'Single quotes

Multiple lines.

With blank lines in between
'
>

<lineBreaksDoubleQoutes
"This is the first sentence here

After the first sentence, comes the blank line, and then the second one."
>
  • 2
    Can you share an example string? – Paolo Aug 29 '18 at 15:30
  • Sorry, what's your question? Do you basically need to add the `'''` case in your regex? – sp00m Aug 29 '18 at 15:32
  • Also note that using a reluctant quantifier (`.*?`) is not efficient, use a negated character set instead. See [this answer](https://stackoverflow.com/a/52019534/3390419) or [this answer](https://stackoverflow.com/a/41269355/3390419) for an explanation. – Paolo Aug 29 '18 at 15:33
  • @UnbearableLightness: You can't use negated character class for `'''` like in `'''blah'blah'''` – Toto Aug 29 '18 at 17:04
  • That's true, however OP states the strings are contained in *either* single, double etc. Therefore that string would not be expected. – Paolo Aug 29 '18 at 17:07
  • @UnbearableLightness: `'''blah'blah'''` is a valid string `blah'blah` enclosed by triple single quote `'''` – Toto Aug 29 '18 at 17:12
  • 1
    Can't we wait for OP's examples? How can you tell what is valid and what is not? – Paolo Aug 29 '18 at 17:14
  • 1
    No example string, no language or app/tool tags (since the pattern syntax depends of them), no answers to questions in comments = close the question as *too broad*. Even if you make an effort to build a pattern and you try to explain your problem. – Casimir et Hippolyte Aug 29 '18 at 21:13
  • Good to have added examples. But can single quotes be found inside double quotes (i.e. `"blah ' blah"`)? Or double quotes inside single (i.e. `'blah " blah'`)? Or `'''blah ' blah '''`? Or escaped one `'blah\'blah'`? or any combination of them? – Toto Aug 30 '18 at 10:06
  • I think the possible "odd" options are double quotes within single quotes - `'blah " blah'` and having `'` or `"` between 3x single quotes, so `''' blah " blah ' blah '''` – Afterburner Aug 30 '18 at 13:21

5 Answers5

2

Use this:

((?:'|"){1,3})([^'"]+)\1

Test it online

Using the group reference \1, you can simplify your work

Also, to get only what is inside of the quotes, use the 2nd group of the match

Matheus Cuba
  • 2,068
  • 1
  • 20
  • 31
  • 1
    This is matching `"''blah"''` or `"""bla"""` or `'"blah'"` – Toto Aug 29 '18 at 16:58
  • 1
    This doesn't match `'''blah'blah'''` – Toto Aug 29 '18 at 17:05
  • Thank you for pointing it @Toto you are right! But as UnbearableLightness commented, I will wait for OP before making any changes, – Matheus Cuba Aug 29 '18 at 18:29
  • 1
    This does work for the most part. However, it matches the quotes as well, while ideally I need just the strings between them. – Afterburner Aug 30 '18 at 09:44
  • 1
    Maybe here are the limits of pure (without programming) regex reached. For more complex substitutions, I like to use: https://github.com/sl5net/SL5_preg_contentFinder/ – SL5net Aug 30 '18 at 10:33
  • I agree, would be much easier if anything but pure regex was an option. However, I am using this to configure a 3rd party service to do what I need and my only option is regex. – Afterburner Aug 30 '18 at 13:23
1

This regex: ('{3}|["']{1})([^'"][\s\S]+?)\1

does what you want.

Some results:

enter image description here

SL5net
  • 2,282
  • 4
  • 28
  • 44
1

Using Notepad++, you can use: ('''|'|")((?:(?!\1).)+)\1

Explanation:

('''|'|")           : group 1, all types of quote 
(                   : group 2
    (?:(?!\1).)+    : any thing that is not the quote in group 1
)                   : end group 2
\1                  : back reference to group 1 (i.e. same quote as the beginning)

Here is a screen capture of the result. enter image description here

Toto
  • 89,455
  • 62
  • 89
  • 125
1

Here's something that may work for you.

^(\"([^\"\n\\]|\\[abfnrtv?\"'\\0-7]|\\x[0-9a-fA-F])*\"|'([^'\n\\]|\\[abfnrtv?\"'\\0-7]|\\x[0-9a-fA-F])*'|\"\"\"((?!\"\"\")[^\\]|\\[abfnrtv?\"'\\0-7]|\\x[0-9a-fA-F])*\"\"\")$

Replace the triple double quotes with triple single quotes. See it in action at regex101.com.

Alan Cabrera
  • 694
  • 1
  • 8
  • 16
1

Named Group Version

Avoids problems when used in larger expressions by explicitly referring to the name of the group storing the last found quote.

Should work for most systems:

(?<Qt>'''|'|")(.*?)\k<Qt>

.NET version:

(?<Qt>'''|'|"")(.*?)\k<Qt>

Works as follows:

  1. '''|'|": Check first for ''', then ', and finally ". Done in this order so ''' has priority over '.
  2. (?<Qt>'''|'|""): When matched, place the match in <Qt> for later use.
  3. (.*?): Capture the results of a lazy search for 0 or more of anything .*? - will return empty strings. To prevent empty strings from being returned, change to a lazy search for 1 or more of anything .+?.
  4. \k<Qt>: Search for the value last stored in <Qt>.
Darin
  • 1,423
  • 1
  • 10
  • 12