3

Here is a Regex I have been trying to make work:

(?:"[^"]*"[^"]*)*?\"{1}([^"]*?([']{1,})[^"]*?)\"

It is probably not the most clean nor the most effective way to achieve what I want to do, but I'm almost there, I think.

My goal is to match any single quote (') that's being quoted between double-quotes ("), so there needs to be an odd amount of double quotes in front of it. I know that for now it only matches the first group of single quotes, that's fine. I will eventually use this regex to replace the first occurence then iterate and use it again to replace the others as long as there are some.

Here are a few examples:

  • " This is a random sentence ' with a quote, the quote should match"
  • " There is no quote here thats the problem" Anything here should not match but now it does: ' .
  • " Some text " some more text " this is a quote : ' that should match"
  • " When there is a quote (') here, the other one does not work : " ' and that's perfect " even if you remove the first one this : " (') " will make it work because of the greedy ( I think ) but ifyou remove those between parenthesis, the other one is matching as of now, which I do not want to happen.
  • Another example would be this one : The following should not work, but it does "This is being quoted" not this: (') " and this is also being quoted "

Note that I really do not consider myself an expert, a few days ago I knew almost nothing except the classic [a-zA-Z0-9]... Any help is welcome, I may have overlooked something basic.

I have been working it here: https://regex101.com/r/aE7iB8/1

Raphaël
  • 173
  • 11
  • I think you have an incorrect assumption, and that's that there needs to be an odd number of double quotes in front of it (depending on what text is allowed). What about the counter-example `'"""\'"'`? Instead of using regex, you should definitely be using a stack. – mbomb007 Mar 23 '16 at 19:30
  • I am not sure I understand what you mean, wouldnt the second ' be matched if we check if there are an odd number of " in front of it ? – Raphaël Mar 23 '16 at 19:48
  • It all depends if you allow nested quotes. Either way, regex is the wrong tool for the job. – mbomb007 Mar 23 '16 at 19:48
  • if I add another double-quote in front of it, it then wouldnt match, which is what I am aiming for: "this is some quoted text" this is not " this is ", maybe I dont understand what you're meaning – Raphaël Mar 23 '16 at 19:50
  • Ah,no there wont be nested quotes – Raphaël Mar 23 '16 at 19:54

2 Answers2

2

Well, here is a regex that works on all your samples - but it's a bit longer and not really perfectly readable. I hope I got all the escapes correctly for the java pattern.

(?:(?:^|\\G(?<!^)[^'\"]*\")[^\"]*+(?:"[^\"']*"[^\"]*)*+"|\\G(?<!^))[^'\"]*+(')

This makes use of the \G-matcher, that will match at the end of the last pattern and of possesive modifiers to avoid unnecessary backtracking.

Let's start at the end, [^'\"]*+(') matches any character, thats not single or double quote followed by a single quote, that is captured into a group.

\\G(?<!^) matches at the end of the last match (the (?<!^) is used to ensure we are not at the start of the string, as that is the position of \G in the first run, before anything is matched. So we will just try, if there is another single quote inside the double quotes we were in the last match.

(?:^|\\G(?<!^)[^'\"]*\")[^\"]*+(?:"[^\"']*"[^\"]*)*+" is used to jump over all sequences that are either outside double quotes or don't contain a single quote. ^|\\G(?<!^)[^'\"]*\" matches either the start of the string (first match) or matches until the closing double quote of our last match, if there is not other single quote inside. [^\"]*+ then matches anything that's not a double quote. (?:"[^\"']*"[^\"]*)*+" then matches any double quotes that don't contain single quotes and sequences outside single quotes until we reach the double quote that starts our matching for the single quote.

But I guess a demo shows it way better than I can explain, so here you are: https://regex101.com/r/tW5xH4/1

Sebastian Proske
  • 8,255
  • 2
  • 28
  • 37
1

If you are planning to iterate anyways, I would consider iterating to grab all the things inside double-quotes first, using this regular expression:

"(.*?)"

This does a non-greedy (first) match of everything between a pair of quotation marks.

(see other ways to grab things between quotation marks here: RegEx: Grabbing values between quotation marks)

Once you have all the strings inside pairs of double quotes, it will be trivial to match any single quote inside these strings.

Community
  • 1
  • 1
shaneb
  • 1,314
  • 1
  • 13
  • 18
  • I am currently working on doing that in Java, but it'd still be nice to do and/or know how to do it using a regex ! – Raphaël Mar 23 '16 at 18:43
  • 1
    Good point to be made here about what a regex is good for, because in a lot of cases it makes more sense to use your program to do the processing. A regex is not (practically) a programming language, it's one of (hopefully) many tools provided by a programming language. – miken32 Mar 23 '16 at 20:17