1

I'm trying to figure out how to isolate all lines with semicolons if they do not contain colons for further regex work. Right now I am using a work around because all the lines that do not contain semicolons or colons also contain a bracket "(". So, I'm just ignoring any line that contains a bracket. The code I have actually doesn't work:

<?php
$filename = "fakexample.txt";
$file = fopen($filename, "rb");
$myFile = fread($file, filesize($filename));

function get_lines($string, $myFile){
  preg_match_all("/$string/m", $myFile, $matches);
  return $matches;
}

$string = "^((?!:|\().)*$";
$list = get_lines($string, $myFile);

foreach($list[1] as $list){
  echo $list."\n";
}
?>

I'm worried that this may not be PHP syntax? Possibly why it isn't working?

The output I get is: d.

The input:

vulture (wing)
tabulations: one leg; two legs; flying
father; master; patriarch    
mat (box)
pedistal; blockade; pilar
animal belly (oval)
old style: naval
jackal's belly; jester
slope of hill (arch)
key; visible; enlightened
Wolfpack'08
  • 3,982
  • 11
  • 46
  • 78
  • Split the lines, iterate over them and filter the `tabulations:` case (`preg_grep()`) and only then do the splitting. – mario Jan 01 '12 at 10:49
  • @Mario, I have something like that I am working on, here: http://pastebin.com/CM4bPReb. I'm not sure how close I am, though. Still troubleshooting syntax errors. – Wolfpack'08 Jan 01 '12 at 10:53
  • @Mario, Still working on it... http://pastebin.com/0Q3ypavr – Wolfpack'08 Jan 01 '12 at 12:34

2 Answers2

2

This might do the trick:

<?php
$filename = "fakexample.txt";
$file = fopen($filename, "rb");
$myFile = fread($file, filesize($filename));

function get_lines($string, $myFile){
  if (preg_match_all("/$string/m", $myFile, $matches))
    return $matches[0];
  else return array();
}

// Match lines with ; but no :
$string = '^[^;:\r\n]*;[^:\r\n]*$';
$lines = get_lines($string, $myFile);

foreach($lines as $line){
  echo $line."\n";
}
?>

Additional:

Here is a breakdown of the above regex, which meets the precise original requirements stated in the question: i.e. "... isolate all lines with semicolons if they do not contain colons ..."

$re = '/ # Match line with ; but no :
    ^           # Anchor to start of line.
    [^;:\r\n]*  # Zero or more non-:, non-;
    ;           # Match one ; (minimum required).
    [^:\r\n]*   # Zero or more non-:.
    $           # Anchor to end of line.
    /xm';

But since you insist on using the expression: '^((?!(:|\()).)*$', it appears that what you really want is to match are: " lines having no colons and no left parentheses." (which is what that expression does). (You probably already understand it but I always like to write expressions fully commented - can't help myself!) So here it is broken down:

$re = '/ # Match line with no colons or left parentheses.
    ^           # Anchor to start of line.
    (           # Step through line one-char at a time.
      (?!       # Assert that this char is NOT...
        (:|\()  # either a colon or a left paren.
      )         # End negative lookahead.
      .         # Safe to match next non-newline char.
    )*          # Step through line one-char at a time.
    $           # Anchor to end of line.
    /xm';

If that is what you really want, fine. But if this is the case then the above expression can be greatly simplified (and sped up) as:

$re = '/ # Match line with no colons or left parentheses.
    ^           # Anchor to start of line.
    [^:(\r\n]*  # Zero or more non-:, non-(, non-EOL.
    $           # Anchor to end of line.
    /xm';

And just for the sake of completeness, if what you really, really need is to match are lines "having at least one semicolon but no colons or left parentheses" Then this one will do that:

$re = '/ # Match line with ; but no : or (
    ^            # Anchor to start of line.
    [^;:(\r\n]*  # Zero or more non-:, non-;, non-(.
    ;            # Match one ; (minimum required).
    [^:(\r\n]*   # Zero or more non-:, non-(.
    $           # Anchor to end of line.
    /xm';

When working with regex is extremely important to precisely define the requirements up front in the question. Regular expressions are a very precise language and they will only do what is asked of them.

I hope this helps!

ridgerunner
  • 33,777
  • 5
  • 57
  • 69
  • Oh, Ridgerunner, your icon is sick. – Wolfpack'08 Jan 02 '12 at 03:09
  • Could you explain the regex, ridge? You're dead close. The last line of semicolon-separated values is arrayized by this code, but the first three lines of semicolon-separated values are not. So, the missing part: `father, master, patriarch, pedistal, blockade, pilar, jackal's belly, jester`. The part captured: `key; visible; enlightened` – Wolfpack'08 Jan 02 '12 at 03:13
  • Oh hey. We got it. Switch the regex in `$string` to `^((?!(:|\()).)*$`. The issue I was having wasn't with the regex itself. It was with the surrounding PHP used to print everything out. I would like to understand regex better, though. I'm reading the manual over and over, trying to remember it. Thanks, bro. Giving you the `golden green diving combat wing`. – Wolfpack'08 Jan 02 '12 at 03:49
  • So basically, I take the result and do a `preg_split()` on semicolons in order to arrayize? – Wolfpack'08 Jan 02 '12 at 05:27
  • The explanation is awesome. The only thing that puzzles me is this line: `[^;:(\r\n]* # Zero or more non-:, non-;, non-(.`. Why 'non-;'? Because it's supposed to be lines with semicolons but without colons, right? – Wolfpack'08 Jan 02 '12 at 05:30
  • Wait.... I just realized it isn't quite right. If you look at the output `father; master; patriarch pedistal; blockade; pilar jackal's belly; jester key; visible; enlightened`, which can also be gotten without a lookahead, `^[^:(]*$`, you can see that the values aren't separated by semicolons. So, they're not in a format where they can be readily split or arrayized, you know? Hmm. I might have to look at the specs more because basically I'm trying to arrayize what's between semicolons, or a newline and a semicolon, or a semicolon and a line break, but not on a line with colons or braces. – Wolfpack'08 Jan 02 '12 at 05:39
  • Do you think I should make the arrayization part a separate question? Because I think this is a lot to work with, already. I wouldn't want to overwhelm readers. – Wolfpack'08 Jan 02 '12 at 05:47
  • 1
    Yes, its best to think through the question you wish to ask and then carefully word it so there is no ambiguity (and don't go changing it midstream! - people spend time answering what you ask). Make it specific and provide example input and desired output. It sounds as though what you are really after is to parse _comma/semicolon separated values_ (CSV). If so, do a search for: "parse CSV" (this gets asked a lot). I have another answer you might want to look at that shows what you are up against. See: [How can I parse a CSV string with Javascript?](http://stackoverflow.com/a/8497474/433790). – ridgerunner Jan 02 '12 at 16:36
  • I actually have a complete answer, now, but I've found an exception. :( Sometimes, there is only one value. So, there's no semicolon on the line! lol – Wolfpack'08 Jan 03 '12 at 06:36
  • @Wolfpack'08 - You're very welcome. (I _love_ solving regex problems!) – ridgerunner Jan 05 '12 at 03:34
0

(?<=;|^)[^;]*(?=;)|(?<=;)[^;]*(?=;|$)

That should work, although that will match empty strings such as the one between ;;, if you don't want this behaviour, just change the asterisks to plus signs.

  • `(?<=;|^)[^;]*(?=;)|(?<=;)[^;]*(?=;|$)` actually returns: `Undefined (?...) sequence.` – Wolfpack'08 Jan 01 '12 at 07:22
  • 1
    Did you put slashes around it? – Robert Allan Hennigan Leahy Jan 01 '12 at 07:23
  • Yes. I'm wondering if it's the engine or something. I'm using a Ruby expression editor, here: http://rubular.com/. I don't know if they support 'look behind', or if we're using that, yet. I'm trying to figure out why that error comes up. – Wolfpack'08 Jan 01 '12 at 07:26
  • 1
    Ruby does not support lookbehind (i.e. `(?<=)`). Your question is tagged PHP. The regex works in a [PHP regex tester such as this one](http://www.spaweditor.com/scripts/regex/). – Robert Allan Hennigan Leahy Jan 01 '12 at 07:28
  • Robert, thank you. I AM looking for PHP code, and I appreciate your using it. Unfortunately, the result set I am getting from the code at the PHP regex tester isn't the desired or expected. It's the entire input string. :/ I'm going to look at the syntax and see what I can do. All the testers I've tried, including the one you linked to, give me strange and different results, though. :/ I'll have to test locally. – Wolfpack'08 Jan 01 '12 at 07:32
  • 1
    If you're looking to run this regex in multi-line mode (i.e. putting an `m` right at the end of your regex, after the last slash) then this regex should work for picking out everything delimited by semicolons but not lines without semicolons: `(?<=;)[^;]*?(?=;|$)|(?<=;|^)[^;]*?(?=;)`. – Robert Allan Hennigan Leahy Jan 01 '12 at 07:54
  • If you want to be able to exclude lines on a certain criterion, use this: `(?:^.*tabulations\:.*$)|((?<=;)[^;]*?(?=;|$)|(?<=;|^)[^;]*?(?=;))`, and you only want patterns that match the first parenthesized subpattern. – Robert Allan Hennigan Leahy Jan 01 '12 at 07:55
  • 1
    Yeah, array index 0 indicates everything that was matched. Array index 1 -- if present -- will indicate everything that matched the first parenthesized subpattern. It's also important that you add `m` to the end of your regex after the last slash to enable multi-line mode. [See this screenshot](http://rleahy.ca/images/stack_overflow/phpregex01012012.png). – Robert Allan Hennigan Leahy Jan 01 '12 at 08:43
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/6285/discussion-between-wolfpack08-and-robert-allan-hennigan-leahy) – Wolfpack'08 Jan 01 '12 at 08:46
  • It looks like, in the screenshot, neither the expected nor the desired results are achieved. – Wolfpack'08 Jan 01 '12 at 08:56
  • Do you try to parse csv file? If so you could use `fgetcsv()`. – piotrekkr Jan 01 '12 at 10:35
  • @piotrekkr I'm trying to get all CSV lines that do not contain a colon, only. I want the array to not include any lines that do not contain commas. Actually, the target lines are semi-colon separated list, though.... – Wolfpack'08 Jan 01 '12 at 10:41
  • 1
    So whet exacly you want? Commas, colons, semicolos? I'm little confused here. Do you want filter those lines so there are only lines with semicolon but without colon? – piotrekkr Jan 01 '12 at 10:57
  • Exclude lines with colons or parens. Arrayize semicolon-separated strings on lines without excluded characters. – Wolfpack'08 Jan 02 '12 at 03:09
  • So, someone told me that you have to use lookahead, not lookbehind. Hope that helps with any projects you work on that are similar in the future. Thanks, bro. It was nice to have someone trying with me. – Wolfpack'08 Jan 02 '12 at 03:52