2

I'm looking for the regex that will match all white space in a string except when it is between quotes.

For example, if I have the following string:

 abc  def  " gh i " jkl  " m n o p " qrst  
-   --   --        -   --           -    --

I want to match the spaces that have a dash under them. The dashes are not part of the string, only for illustration purposes.

Can this be done?

Graham
  • 7,807
  • 20
  • 69
  • 114

3 Answers3

7

You could try the below positive lookahead based regex.

\s(?=(?:"[^"]*"|[^"])*$)

or

 (?=(?:"[^"]*"|[^"])*$)

DEMO

Explanation:

  • \s Matches a space character

  • (?=(?:"[^"]*"|[^"])*$) only if it's followed by,

    1. "[^"]*" double quotes plus [^"]* any character not of double quotes zero or more times plus a closing double quotes. So it matches the double quotes block ie, like "foo" or "ljilcjljfcl"

    2. | OR If the following character is not of a double quotes, then the control switches to the pattern next to the | or part ie, [^"].

    3. [^"] Matches any character but not of a double quotes.

Take foo "foo bar" buz as an example string.

foo "foo bar" buz             

\s at first matches all the spaces. Then it checks the condition that the matched spaces must be followed by double quoted string or [^"] zero or more times. So it checks that the first space if followed by a double quoted string or not. Yes, the first space if followed by a double quoted string "foo bar", then the character following the double quoted string is a space. Now the regex "[^"]*" got failed and the control switches to the next part ie, [^"]. This pattern matches the following space. Because * applies to that pattern [^"]* matches all the following characters. Finally the condition is satisfied for the first space, so it got matched.

Avinash Raj
  • 172,303
  • 28
  • 230
  • 274
  • Don't understand it, but it works - thank you. The first one does ALL white space which is what I need – Graham Jan 28 '15 at 14:11
4
[ ](?=(?:[^"]*"[^"]*")*[^"]*$)

Try this.See demo.

https://regex101.com/r/pM9yO9/7

This basically states that find any space which has groups of "" in front of it but not an alone ".It is enforced through lookahead.

vks
  • 67,027
  • 10
  • 91
  • 124
  • 1
    @Graham forget it............loads of tresspassers who do not understand and post a free downvote :P – vks Jan 28 '15 at 14:29
2

If your regex flavor is PCRE could (*SKIP)(*F) the quoted stuff or replace one or more \s

"[^"]*"(*SKIP)(*F)|\s+

Test at regex101.com

Jonny 5
  • 12,171
  • 2
  • 25
  • 42
  • 1
    It's indeed from far more efficient. and deals with an eventual orphan quote. – Casimir et Hippolyte Jan 28 '15 at 14:19
  • Don't know what PCRE is, but I clearly don't have it, as it is not recognized by my regex engine. – Graham Jan 28 '15 at 14:20
  • 1
    @Graham: PCRE is the regex engine used by PHP. It is the reason why I asked you what language you used. – Casimir et Hippolyte Jan 28 '15 at 14:21
  • 1
    @Graham https://regex101.com/r/pM9yO9/8 something similar but you will have to use replace.Anyways this is a nice trick. – vks Jan 28 '15 at 14:22
  • @vks: yes you need to use a capture group to differentiate the two branches. In my opinion it is the way to go (when you don't have `(*SKIP)(*F)`) – Casimir et Hippolyte Jan 28 '15 at 14:26
  • @CasimiretHippolyte for python like languages :) – vks Jan 28 '15 at 14:27
  • @vks: or javascript. The main advantage is that the number of steps stay constant whatever the length of the string. When you use a lookahead the number of steps increases quickly in particular inside quotes. – Casimir et Hippolyte Jan 28 '15 at 14:32
  • @CasimiretHippolyte number of steps ....whenver we use any group be it lookahead or any....number of steps is more as enigine takes one to go in and one go out.But does that lead to bad perfomrance ,m not sure about that – vks Jan 28 '15 at 14:34
  • @vks: time it with your language and a big string, you will know. or use the utility `pcretest`. – Casimir et Hippolyte Jan 28 '15 at 14:53