4

I have the following line:

"14:48 say;0ed673079715c343281355c2a1fde843;2;laka;hello ;)"

I parse this by using a simple regexp:

if($line =~ /(\d+:\d+)\ssay;(.*);(.*);(.*);(.*)/) {
    my($ts, $hash, $pid, $handle, $quote) = ($1, $2, $3, $4, $5);
}

But the ; at the end messes things up and I don't know why. Shouldn't the greedy operator handle "everything"?

AndersTornkvist
  • 2,610
  • 20
  • 38
Lasse A Karlsen
  • 801
  • 5
  • 14
  • 23

6 Answers6

18

The greedy operator tries to grab as much stuff as it can and still match the string. What's happening is the first one (after "say") grabs "0ed673079715c343281355c2a1fde843;2", the second one takes "laka", the third finds "hello " and the fourth matches the parenthesis.

What you need to do is make all but the last one non-greedy, so they grab as little as possible and still match the string:

(\d+:\d+)\ssay;(.*?);(.*?);(.*?);(.*)
Barry Brown
  • 20,233
  • 15
  • 69
  • 105
  • That's great! Can you quick tell me the difference between .*? og .* Thanks! :) – Lasse A Karlsen Nov 01 '08 at 17:49
  • 2
    The difference is that .*? stops at the first instance of whatever follows, whereas .* stops at the last instance of whatever follows. – eyelidlessness Nov 01 '08 at 17:54
  • Ah, great folks! Appreciate it! :-) – Lasse A Karlsen Nov 01 '08 at 17:56
  • 2
    The ? modifies the * operator to make it non-greedy. You can also use ? with + to make it non-greedy, as well. – Barry Brown Nov 01 '08 at 18:06
  • 2
    Very good general-case answer, but, for this specific question, I would favor [^;]* over .*? because the boundary which terminates the match is a single character. There are cases where .*? is what you need, but I find it best to avoid .* entirely whenever possible. – Dave Sherohman Nov 02 '08 at 16:13
7
(\d+:\d+)\ssay;([^;]*);([^;]*);([^;]*);(.*)

should work better

VonC
  • 1,262,500
  • 529
  • 4,410
  • 5,250
  • I think you have an extra ([^;]*); I think the last part is a comment with a smily "Hello ;)" – Ady Nov 01 '08 at 17:46
  • Ady: Right: the last part can be as simple as (.*) to get the rest of the line. Fixed – VonC Nov 01 '08 at 18:02
7

Although a regex can easily do this, I'm not sure it's the most straight-forward approach. It's probably the shortest, but that doesn't actually make it the most maintainable.

Instead, I'd suggest something like this:

$x="14:48 say;0ed673079715c343281355c2a1fde843;2;laka;hello ;)";

if (($ts,$rest) = $x =~ /(\d+:\d+)\s+(.*)/)
{
    my($command,$hash,$pid,$handle,$quote) = split /;/, $rest, 5;
    print join ",", map { "[$_]" } $ts,$command,$hash,$pid,$handle,$quote
}

This results in:

[14:48],[say],[0ed673079715c343281355c2a1fde843],[2],[laka],[hello ;)]

I think this is just a bit more readable. Not only that, I think it's also easier to debug and maintain, because this is closer to how you would do it if a human were to attempt the same thing with pen and paper. Break the string down into chunks that you can then parse easier - have the computer do exactly what you would do. When it comes time to make modifications, I think this one will fare better. YMMV.

Tanktalus
  • 21,664
  • 5
  • 41
  • 68
3

Try making the first 3 (.*) ungreedy (.*?)

Greg
  • 316,276
  • 54
  • 369
  • 333
3

If the values in your semicolon-delimited list cannot include any semicolons themselves, you'll get the most efficient and straightforward regular expression simply by spelling that out. If certain values can only be, say, a string of hex characters, spell that out. Solutions using a lazy or greedy dot will always lead to a lot of useless backtracking when the regex does not match the subject string.

(\d+:\d+)\ssay;([a-f0-9]+);(\d+);(\w+);([^;\r\n]+)
Jan Goyvaerts
  • 21,379
  • 7
  • 60
  • 72
  • Jan, if you want something to be marked up as source code, each line has to start with four spaces. And welcome to SO. – Alan Moore Nov 02 '08 at 12:28
2

You could make * non-greedy by appending a question mark:

$line =~ /(\d+:\d+)\ssay;(.*?);(.*?);(.*?);(.*)/

or you can match everything except a semicolon in each part except the last:

$line =~ /(\d+:\d+)\ssay;([^;]*);([^;]*);([^;]*);(.*)/
Robert Gamble
  • 106,424
  • 25
  • 145
  • 137