12

EDIT: Can anyone help me out with a regular expression for a string such as this?:

[Header 1], [Head,er 2], Header 3

so that I can split this into chunks like:

[Header 1]
[Head,er 2]
Header 3

I have gotten as far as this:

(?<=,|^).*?(?=,|$)

Which will give me:

[Header 1]
[Head
,er 2]
Header 3

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
Nate
  • 231
  • 1
  • 4
  • 13
  • How man CSV implementations does the world need??? – Joachim Sauer Apr 08 '09 at 21:55
  • Is this a homework question? Because I find it simpler to just use plain old manipulation - basically: for each char: if char is comma and not inside a bracket then add current string to list – Lucas Jones Apr 08 '09 at 22:11

6 Answers6

22

In this case it's easier to split on the delimiters (commas) than to match the tokens (or chunks). Identifying the commas that are delimiters takes a relatively simple lookahead:

,(?=[^\]]*(?:\[|$))

Each time you find a comma, you do a lookahead for one of three things. If you find a closing square bracket first, the comma is inside a pair of brackets, so it's not a delimiter. If you find an opening bracket or the end of the line/string, it's a delimiter.

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
  • Ah I see, I can replace the commas with another special char and split accurately using that. That'll work for me! Thanks! – Nate Apr 15 '09 at 19:08
  • This works perfect as long as there are no nested brackets. For example, works as expected for `[a],[b],[c[d,e]]` but fails in `[a],[b],[c,[d,e]]`. Matches the comma next to c in the last example. How can this be improved so it does not match that also? – matte Jul 09 '12 at 14:38
  • Actually, to be more precise for `[a],[b,[]` it matches the comma after b. If there is any opening square bracket in `[]`, this pattern matches the comma in the brackets. – matte Jul 09 '12 at 14:53
  • I think if brackets can be nested, `split` is no longer an option; you would have to match the tokens. And if they can be nested more than one level deep, regex might not be an option at all. (Many flavors can handle nesting to arbitrary depth, but it's ugly as hell.) – Alan Moore Jul 10 '12 at 00:51
  • i have quick question..why `"this is me".split(/(\s)/);` is different than `"this is me".split(/\s/);`. It's only in split not in .match for example. JS. – Muhammad Umer Aug 31 '13 at 19:58
  • When the split regex contains capturing groups, the captured bits are treated as tokens and included in the results. – Alan Moore Sep 01 '13 at 12:08
  • That was nice. When you have a hammer everything looks like a nail, and I'm starting to see `NotThis|(CaptureThis)` everywhere, but no need here, your selection of the comma here was quick and efficient. :) – zx81 May 21 '14 at 06:38
6
\[.*?\]

Forget the commas, you don't care about them. :)

JP Alioto
  • 44,864
  • 6
  • 88
  • 112
  • Well, now I'm confused. Does it really say Header or is that some placeholder? Are the brackets really there or optional? It has now become confusing exactly what the valid input strings are. – JP Alioto Apr 08 '09 at 23:56
  • Sorry about changing it, Valid input strings are [Some Text], Some More Text, [Yet mo,re Text] ...split into [Some Text] / Some more Text / [Yet mo,re Text] – Nate Apr 09 '09 at 15:33
2

Variations of this question have been discussed before.

For instance:

Short answer: Regular Expressions are probably not the right tool for this. Write a proper parser. A FSM implementation is easy.

Community
  • 1
  • 1
dmckee --- ex-moderator kitten
  • 98,632
  • 24
  • 142
  • 234
2
 (?<=,|^)\s*\[[^]]*\]\s*(?=,|$)

use the [ and ] delimiters to your advantage

rampion
  • 87,131
  • 49
  • 199
  • 315
1

Isn't it as simple as this?

(?<=,|^)(?:[^,]|\[[^[]*\])*
jpalecek
  • 47,058
  • 7
  • 102
  • 144
  • When I use your regex, I get the following form the dev tools: `regex = /(?<=,|^)(?:[^,]|\[[^[]*\])*/ SyntaxError: Invalid regular expression: /(?<=,|^)(?:[^,]|\[[^[]*\])*/: Invalid group` – starbeamrainbowlabs Jan 04 '13 at 18:47
1

You could either use a regular expression to match the values inside the brackets:

\[[^\]*]\]

Or you use this regular expression to split the bracket list (using look-around assertions):

(?<=]|^)\s*,\s*(?=\[|$)
Gumbo
  • 643,351
  • 109
  • 780
  • 844