17

I need to match a colon (':') in a string, but not when it's enclosed by quotes - either a " or ' character.

So the following should have 2 matches

something:'firstValue':'secondValue'
something:"firstValue":'secondValue'

but this should only have 1 match

something:'no:match'
TheLizzard
  • 7,248
  • 2
  • 11
  • 31
Jaco Pretorius
  • 24,380
  • 11
  • 62
  • 94
  • 1
    @Jaco: 1) What language? 2) Isn't it way easier to split the string on ['"] first so you can check all uneven-numbered items in the array. – Huppie Sep 18 '09 at 09:10
  • You would be better off with a parser. – Gumbo Sep 18 '09 at 09:16
  • @Gumbo...I guess that's what he want to achieve. My advise: Read byte-wise and use a flag if you're in quotes – Scoregraphic Sep 18 '09 at 09:19
  • You need to specify which regex implementation you will be using. – DigitalRoss Sep 18 '09 at 09:20
  • Although I have to agree with the others that it's actually harder to do this with a regex than with a simple scan. – DigitalRoss Sep 18 '09 at 09:22
  • This was in C#, but I thought the language was unimportant. But I'm starting to think I would be better off with a parser like Gumbo said. – Jaco Pretorius Sep 18 '09 at 10:59
  • Possible duplicate question: [A regex to detect string not enclosed in double quotes](https://stackoverflow.com/questions/11324749/a-regex-to-detect-string-not-enclosed-in-double-quotes) – Simon East Aug 23 '22 at 08:50

5 Answers5

7

If the regular expression implementation supports look-around assertions, try this:

:(?:(?<=["']:)|(?=["']))

This will match any colon that is either preceeded or followed by a double or single quote. So that does only consider construct like you mentioned. something:firstValue would not be matched.

It would be better if you build a little parser that reads the input byte-by-byte and remembers when quotation is open.

Gumbo
  • 643,351
  • 109
  • 780
  • 844
3

Regular expressions are stateless. Tracking whether you are inside of quotes or not is state information. It is, therefore, impossible to handle this correctly using only a single regular expression. (Note that some "regular expression" implementations add extensions which may make this possible; I'm talking solely about "true" regular expressions here.)

Doing it with two regular expressions is possible, though, provided that you're willing to modify the original string or to work with a copy of it. In Perl:

$string =~ s/['"][^'"]*['"]//g;
my $match_count = $string =~ /:/g;

The first will find every sequence consisting of a quote, followed by any number of non-quote characters, and terminated by a second quote, and remove all such sequences from the string. This will eliminate any colons which are within quotes. (something:"firstValue":'secondValue' becomes something:: and something:'no:match' becomes something:)

The second does a simple count of the remaining colons, which will be those that weren't within quotes to start with.

Just counting the non-quoted colons doesn't seem like a particularly useful thing to do in most cases, though, so I suspect that your real goal is to split the string up into fields with colons as the field delimiter, in which case this regex-based solution is unsuitable, as it will destroy any data in quoted fields. In that case, you need to use a real parser (most CSV parsers allow you to specify the delimiter and would be ideal for this) or, in the worst case, walk through the string character-by-character and split it manually.

If you tell us the language you're using, I'm sure somebody could suggest a good parser library for that language.

Dave Sherohman
  • 45,363
  • 14
  • 64
  • 102
  • I'm using C# but I thought that I could do it with a Regex (which is language independent)... I think it's better to just parse it without Regex tho – Jaco Pretorius Sep 18 '09 at 11:02
  • 1
    That's the trouble; a regex isn't language/library independent; the parts that are can't do this. – reinierpost Sep 18 '09 at 12:37
1

Uppps ... missed the point. Forget the rest. It's quite hard to do this because regex is not good at counting balanced characters (but the .NET implementation for example has an extension that can do it, but it's a bit complicated).

You can use negated character groups to do this.

[^'"]:[^'"]

You can further wrap the quotes in non-capturing groups.

(?:[^'"]):(?:[^'"])

Or you can use assertion.

(?<!['"]):(?!['"])
Daniel Brückner
  • 59,031
  • 16
  • 99
  • 143
0

I've come up with the following slightly worrying construction:

(?<=^('[^']*')*("[^"]*")*[^'"]*):

It uses a lookbehind assertion to make sure you match an even number of quotes from the beginning of the line to the current colon. It allows for embedding a single quote inside double quotes and vice versa. As in:

'a":b':c::"':" (matches at positions 6, 8 and 9)

EDIT

Gumbo is right, using * within a look behind assertion is not allowed.

heijp06
  • 11,558
  • 1
  • 40
  • 60
  • This expression will only match if the string starts with a single quote because of the assertion (?<=^('[^... – Daniel Brückner Sep 18 '09 at 09:41
  • @Daniel - ('[^']*')* matches zero or more instances of something between single quotes, so it does not have to start with a quote. Having said that mine is broken to, see my edit – heijp06 Sep 18 '09 at 09:47
  • 1
    In general, look-behind assertions don’t allow infinite quantifiers such as `*`. – Gumbo Sep 18 '09 at 09:48
0

You can try to catch the strings withing the quotes

/(?<q>'|")([\w ]+)(\k<q>)/m

First pattern defines the allowed quote types, second pattern takes all Word-Digits and spaces. Very good on this solution is, it takes ONLY Strings where opening and closing quotes match.

Try it at regex101.com

Radon8472
  • 4,285
  • 1
  • 33
  • 41
  • I think this regex doesn't do what was asked for: split at colons except when they are within quoted parts. – not2savvy Dec 10 '19 at 10:47
  • It is not fully clear in the question, but my regex catches all colons. maybe it could be modified, if the question author wants to catch pieces without quotes too. – Radon8472 Dec 11 '19 at 14:03