2

This is my current regex (used in parsing an iCal file):

/([^:]+)[:|(;)]([\w\W]*)/

The current output using preg_match() is this:

//Output 1
Array
(
    [0] => DTEND;TZID="Greenwich Mean Time : Dublin, Edinburgh, Lisbon, London":20150601T073000
    [1] => DTEND;TZID="Greenwich Mean Time 
    [2] =>  Dublin, Edinburgh, Lisbon, London":20150601T073000
)

I would like to amend my regex to output this (i.e. ignore a colon if it is part of a phrase surrounded by double quotes - I think I need a lookbehind and there would only ever be one colon to find as it's a separator):

//Output 2
Array
(
    [0] => DTEND;TZID="Greenwich Mean Time : Dublin, Edinburgh, Lisbon, London":20150601T073000
    [1] => DTEND;TZID="Greenwich Mean Time : Dublin, Edinburgh, Lisbon, London"
    [2] => 20150601T073000
)

The semicolon in the regex is there because sometimes the colon I'm looking for might be on the next line due to multiple properties being defined (;TZID="Greenwich Mean Time : Dublin, Edinburgh, Lisbon, London") so in this case I break on the semicolon. For info, the iCal file is read in one line at a time.

u01jmg3
  • 712
  • 1
  • 11
  • 31

2 Answers2

1

You need a regex based on a SKIP-FAIL trick that can safely match patterns outside of other patterns. However, I cannot find a 1-regex solution :(. You can use the main one to match colons outside the quoted strings, and if it fails to fetch you an array of more than 1 element, use another one:

"(?:[^"](?:\\.[^"]+)?)+"(*SKIP)(*FAIL)|:

And

 "(?:[^"](?:\\.[^"]+)?)+"(*SKIP)(*FAIL)|;

The "(?:[^"](?:\\.[^"]+)?)+" will safely match any escaped entities (if any).

$re = '#"(?:[^"](?:\\.[^"]+)?)+"(*SKIP)(*FAIL)|:#'; 
$str = "DTEND;TZID=\"Greenwich Mean Time : Dublin, Edinburgh, Lisbon, London\":20150601T073000";
//$str = "DTEND;TZID=\"Greenwich Mean Time : Dublin, Edinburgh, Lisbon, London\";20150601T07300001T073000"; 
$arr = preg_split($re, $str);
if (count($arr)>1){
  print_r($arr);
}
else {
 $re2 = '#"(?:[^"](?:\\.[^"]+)?)+"(*SKIP)(*FAIL)|;#';
 $arr2 = preg_split($re2, $str);
 if (count($arr2)>1){
  print_r($arr2);
 }
 else {
  echo "No matches";
 }
}

IDEONE Demo

And just another try (not sure):

"(?:[^"](?:\\.[^"]+)?)+"(*SKIP)(*FAIL)|(?!.*:);(?=[^:]*$)|(?!.*;):(?=[^;]*$)

See demo

Community
  • 1
  • 1
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Thanks but demo gives no matches? – u01jmg3 Jun 01 '15 at 17:26
  • There is a match, the colon is matched. It is the colon that the string should be split against. – Wiktor Stribiżew Jun 01 '15 at 17:28
  • Thanks but the semicolon has gone? I need that as a fallback if the colon can't be found as per my question description. Sorry to make it more complicated. – u01jmg3 Jun 01 '15 at 17:42
  • It is not complicated, I am trying to look after my son, answer my wife's questions, and help you at the same time :) I will correct my answer ASAP. I guess you want to split by a colon outside quoted substrings, but if there is no colon, use a semicolon outside of quoted substrings. Right? A sample string would greatly facilitate things. – Wiktor Stribiżew Jun 01 '15 at 18:10
  • Please check. If it is not what you ask, perhaps, we need more examples, or this is not a task for a regex... Or perhaps, your input is simple enough for vks' workaround. – Wiktor Stribiżew Jun 01 '15 at 18:46
  • Thanks for editing again - I prefer your (cleaner) approach with the SKIP-FAIL trick but for my current setup vks' workaround is enough for the input I am using. It outputs the matches for both the colon and semicolon in a manner my setup is expecting. Appreciate your help nonetheless – u01jmg3 Jun 01 '15 at 21:31
  • :) Great. Do not forget to upvote if an answer proved helpful/useful to you. – Wiktor Stribiżew Jun 01 '15 at 21:32
0
(.*?)(?::(?=(?:[^"]*"[^"]*")*[^"]*$)|;(?=[^:]*$))([\w\W]*)

You can try this.See demo.

https://regex101.com/r/pG1kU1/9

vks
  • 67,027
  • 10
  • 91
  • 124
  • Thanks but looking for output exactly like the 'Output 2' snippet above – u01jmg3 Jun 01 '15 at 17:27
  • Thanks but where has the semicolon gone? I need that as a fallback if the colon can't be found as per my question description. – u01jmg3 Jun 01 '15 at 17:41
  • Thanks but that doesn't fix things - the semicolon is a fallback if a colon cannot be found. It's not correct to define `[:;]` because if the semicolon comes first it will use it rather than searching for a colon which is what I want to find. Semicolon is merely a fallback in the event the colon cannot be found so things don't break and I need to capture the semicolon if that is what is to be used (i.e. `(;)`). Appreciate your help – u01jmg3 Jun 01 '15 at 18:01
  • I have no idea who downvoted - it is the accepted answer – u01jmg3 Jun 02 '15 at 16:35
  • @u01jmg3 i have answered ur new question.do have a look :) – vks Jun 02 '15 at 16:35
  • Thanks @vks for helping me out again – u01jmg3 Jun 02 '15 at 17:10
  • @u01jmg3 netime sir :) – vks Jun 02 '15 at 17:10