1

I have a perl regex that i'm fairly certain should work (perl) but is being too greedy:

regex: (?:.*serial[^\d]+?(\d+).*)

Test string: APPLICATIONSERIALNO123456Plnsn123456te20140728tdrnserialnun12hou

Desired group 1 match: 123456

Actual group 1 Match: 12

I've tried every permutation of lookahead and behind and laziness and I can't get the damn thing to work.

WHAT AM I MISSING.

Thanks!

Bob Fishel
  • 123
  • 2
  • 5
  • 15

2 Answers2

4

The Problem is Not Greediness, but Case-Sensitivity

Currently your regex matches the 12 at the end of serialnun12, probably because it is case-sensitive. We have two options: using upper-case, or making the pattern case-insensitive.

Option 1: Use Upper-Case

If you only want 123456, you can use:

SERIALNO\K\d+

The \K tells the engine to drop what was matched so far from the final match it returns.

If you want to match the whole string and capture 123456 to Group 1, use:

.*?SERIAL\D+(\d+).*

Option 2: Turning Case-Sensitivity On using (?i) inline or the i flag

To only match 123456, you can use:

(?i)serial\D+\K\d+

Note that if you use the g flag, this would match both numbers.

If you want to match the whole string and capture 123456 to Group 1, use:

(?i).*?serial\D+(\d+).*

A few tips

  • You can turn case-insensitivity either with the (?i) inline modifier or the i flag at the end of the pattern: /serial\D+\K\d+/i
  • Instead of [^\d], use \D
  • There is no need for a lazy quantifier in something like \D+\d+ because the two tokens are mutually exclusive: there is no danger that the \D will run over the \d
zx81
  • 41,100
  • 9
  • 89
  • 105
  • And still showing different options and elaborate :] – Jonny 5 Jul 29 '14 at 21:52
  • What's with the three mentions of `\K` outside of substitutions? – ikegami Jul 29 '14 at 22:12
  • @Jonny5 Not understanding your comment, can you please explain? – zx81 Jul 29 '14 at 22:15
  • @ikegami Not understanding your question, can you please explain? (What substitutions?) – zx81 Jul 29 '14 at 22:16
  • @zx81 I always like reading your answers - they're still so detailed, after you wrote so many answers already :) – Jonny 5 Jul 29 '14 at 22:19
  • @zx81, There are no substitutions, so why are you using `\K`? Half of your post doesn't seem to have anything to do with the question. – ikegami Jul 29 '14 at 22:24
  • @Jonny5 Ah... I understand now. Thank you for your kind and supportive comment. There is pleasure in taking the time to write a "proper answer", even if only three people see it. – zx81 Jul 29 '14 at 22:24
  • @ikegami because no capturing group is needed, if resetting at `\K`. OP wants only `123456`. See it as a variable length lookbehind. read [more about \K](http://www.rexegg.com/regex-php.html#k) – Jonny 5 Jul 29 '14 at 22:30
  • @ikegami I have no idea what you are talking about. What does `\K` have to do with substitutions? As the answer explains, the `\K` tells the engine to drop what was matched so far from the final match it returns. That allows us to match `123456` in a very compact way. – zx81 Jul 29 '14 at 22:34
  • `perl -E"say 'abc' =~ /.\K./"` prints `1`. If you want to return `b`, you'd use `perl -E"say 'abc' =~ /.\K(.)/"`, which is the same as `perl -E"say 'abc' =~ /.(.)/"`. There's no reason to use `\K` outside of substitutions. – ikegami Jul 29 '14 at 22:35
  • @ikegami `Half of your post doesn't seem to have anything to do with the question.` Coming from someone of your skill, this surprises me. Can you give me even one line in my answer that does not have anything to do with the question?... Or are we reading a different question? This comment also feels aggressive to me, and that surprises me too, as I have always treated you cordially and respectfully. – zx81 Jul 29 '14 at 22:36
  • Half of it relates to `\K` which has no use in matching. – ikegami Jul 29 '14 at 22:37
  • @ikegami `\K which has no use in matching.` Then I respectfully submit that perhaps you do not understand `\K`? It is worth five million bucks in matching. See [this demo](http://regex101.com/r/mU7lT6/6): good luck matching `123456` in a more compact fashion. – zx81 Jul 29 '14 at 22:40
  • Looked at the demo... That's not Perl. I'm talking about Perl. It'll affect what GNU `grep` outputs, for example, but we're not talking about that. – ikegami Jul 29 '14 at 22:41
  • @ikegami As well you could say, half of it relates to `\d` :p Why not use `\K` if it prevents the use of unneeded capturing-group? – Jonny 5 Jul 29 '14 at 22:41
  • @ikegami Yeah, sure, and `.*` also matches the same input. That's not the point. `SERIALNO\K\d+` matches `123456`, **which is what the OP wants**, `SERIALNO\d+` matches `SERIALNO123456`, which is **NOT** what the OP wants. – zx81 Jul 29 '14 at 22:43
  • @Jonny 5, Second time you mentioned unneeded capture group, but that makes no sense. If you want the value, you need a capture group. If you don't want the value, then `\K` doesn't help. – ikegami Jul 29 '14 at 22:43
  • @zx81, I can repeat myself in bold too. **`perl -E"say '123456' =~ /\K123456/"` doesn't return 123456!** You need `perl -E"say '123456' =~ /\K(123456)/"` for that, which is the same as `perl -E"say '123456' =~ /(123456)/"`. – ikegami Jul 29 '14 at 22:44
  • @ikegami Try `echo "APPLICATIONSERIALNO123456Plnsn123456te20140728tdrnserialnun12hou" | perl -ne 'print "$&\n" if m/SERIALNO\K\d+/'` **Result: `123456`** This is the exact string of the OP, and the regex I gave him. Now tell me again what the problem is??? – zx81 Jul 29 '14 at 22:48
  • @xz81, Yeah, it makes a difference if you make the mistake of using `$&`. Using `$&` slows down every single match without captures in your interpreter. Two wrongs don't make a right. If you're going to argue "but it's a one liner", then length matters and you're better of using a capture. – ikegami Jul 29 '14 at 22:49
  • @ikegami `if you're stupid enough` You're way out of line. I auto-generated that code in one of my templates. That's not the point. – zx81 Jul 29 '14 at 22:50
  • You're right. People can do stupid things without being stupid. That's why I rephrased over a minute before your comment. – ikegami Jul 29 '14 at 22:51
  • I didn't say you can't use `\K` outside of substitutions. I said there's no reason to do so. So yeah, that is the point. – ikegami Jul 29 '14 at 22:53
  • @ikegami `there's no reason to do so` Sure, there are different idioms. If you don't want to use `\K`, that's your right. By the same token, you could say that there's no reason to use capture groups, since you have `\K`. It's a matter of perspective. For my esthetics, the `\K` answer looks better—and it is very handy in more complex expressions. A `$1` answer is fine too, and of course that's what the full match parts of my answer use. I still think that every line of my answer has everything to do with the question. We'll have to agree to disagree. – zx81 Jul 29 '14 at 22:58
  • @zx81, Re "A `$1` answer is fine too," You have yet to show that `\K` can replace `$1` without introducing a serious bug. If you think `\K` is useful, show how! That's all you have to do, but you have yet to do so! You keep saying how good it is, so it should super easy to show an example, so why won't you? – ikegami Jul 30 '14 at 11:31
  • @ikegami `Please give an example` I remember writing some lovely expressions with multiple `\K` in different branches—not branches that would have been candidates for branch reset—perhaps in conjunction with `\G`, a bit hard to invent just like that, but if I find an old answer like that I may shoot it your way. As for simple ones, I use them every day. – zx81 Jul 30 '14 at 11:35
  • Re " As for simple ones, I use them every day." So give one already! – ikegami Jul 30 '14 at 11:42
  • @ikegami Giving you a simple one is easy, give me a minute. Here's one that's more interesting. You're in Notepad++, you want to find one word at the beginning of a line that was already at the beginning of an earlier line. `(?m)^(\S+)\b.*\r?\n(?s).*?\K^\1\b` – zx81 Jul 30 '14 at 11:44
  • @ikegami As for a simple one... Anywhere where you would love to have an infinite lookbehind. `\K` gives you a direct match without capture groups—but we know you don't care about that. – zx81 Jul 30 '14 at 11:47
  • Show an example!!! Re "Anywhere where you would love to have an infinite lookbehind.", In a match, you can just drop the `\K` and get the same result. Re "`\K` gives you a direct match without capture groups", That's just not possible without introducing a serious bug – ikegami Jul 30 '14 at 11:48
  • @ikegami You saw the one from two lines above? Going through some of my answers, there are some cool patterns with `\K` in [this question](http://stackoverflow.com/a/23728051/1078583) (options 1 and 2) – zx81 Jul 30 '14 at 11:51
  • [No, I hadn't noticed you had written two comments.] A Perl example, please. I already acknowledged that `\K` is useful outside of Perl. This is a Perl question being answered. – ikegami Jul 30 '14 at 11:52
  • As for the linked question, it's not runnable code, so as far as I know, you'd get the same result by dropping the `\K` as I said earlier. – ikegami Jul 30 '14 at 11:54
  • @ikegami The simplest one that comes to mind: you want to match `C` if it is preceded by `A` and any number of letters `B`. Unless you're in .NET or Barnett Python, you cannot do `(?<=AB+)C`. Of course you can capture with `AB+(C)`. But I would tend to direct match with `AB+\KC` I(remember this is not just Perl but PCRE and Ruby 2+) I do this all the time. For me `\K` is mostly a hack for engines that don't have infinite lookbehind. – zx81 Jul 30 '14 at 11:56
  • @ikegami You must have downvoted my answer, as the downvote happened at the same time as your first in the last round of comments. So you downvote working answers? You know what it says when you hover over the downvote button: "This answer is not useful"? I will make a minor edit to give you a chance to rectify—otherwise I am done talking with you, as for me this is an unacceptable standard of behavior. Btw if you get a downvote do not think I retaliated, I don't do that. My DVs are at 40, they are not moving. – zx81 Jul 30 '14 at 12:07
  • First you call me stupid, then you downvote a detailed, working answer. Shame on you, @ikegami. – zx81 Jul 30 '14 at 12:16
  • You cannot do `(?<=AB+)C`, but you can do `AB+C`. No need for `\K`. Please give an example where `\K` can't simply be removed with no effect! – ikegami Jul 30 '14 at 12:19
  • Re "So you downvote working answers?" Working???? Half the solutions don't produce `123456` as the OP requested. I gave you plenty of chances for you to back up your claims. Shame on *you*. I already explained I didn't mean to call you stupid and had removed the comment within 5s of it being posted. – ikegami Jul 30 '14 at 12:26
  • @ikegami They do produce `123456`. [DEMO](http://regex101.com/r/dU5lQ3/6) They do. They do. They do. Goodbye. – zx81 Jul 30 '14 at 12:29
  • No they don't. If you want to be treated like a baby, I'll do that: **Your answers using `\K` don't produce `123456` in Perl** unless you use it in substitutions (which we've both said are not applicable) or unless you introduce a serious bug. After giving you a colossal number of chances to prove otherwise, I downvoted your answer for including a lot of incorrect information. – ikegami Jul 30 '14 at 12:58
  • @ikegami Why don't you just provide your own "correct" answer? Less effort than comment-spamming on zx81, who provides working answers of high quality and knowledge. – Jonny 5 Jul 30 '14 at 15:17
  • 1
    @Jonny 5, It really doesn't matter if he normally provides working answers. // As for why I didn't provide an answer, it's like you said: most of his solution is fine; only part of it doesn't work. I didn't want to just go and copy the working parts of his solution. Is that what you are asking me to do? // As for our "comment spamming", he *seems* to know what he was talking about and he kept claiming he knew what he was talking about, so I thought I might have missed something. But turns out that as well as he knows regex, he doesn't know how they're used in Perl. – ikegami Jul 30 '14 at 15:19
  • 1
    Thanks! I actually had it case insensitive but was able to use this answer to get where I needed to go :) Thanks a lot! – Bob Fishel Jul 30 '14 at 18:56
3

The problem is not greediness; it's case-sensitivity.

Currently your regex matches the 12 at the end of serialnun12 because those are the only digits following serial. The ones you want follow SERIAL. S and s are different characters.

There are two solution.

  1. Use the uppercase characters in the pattern.

    my ($serial) = $string =~ /SERIAL\D*(\d+)/;
    
  2. Use case-insensitive matching.

    my ($serial) = $string =~ /serial\D*(\d+)/i;
    

    There's probably no need for this, but I thought I'd mention it just in case.

ikegami
  • 367,544
  • 15
  • 269
  • 518
  • @Jonny, `\K` in a match is only useful if you also use `$&`, which causes every single match without captures in your entire interpreter to be slower when used. – ikegami Jul 30 '14 at 18:31
  • The [manual says](http://perldoc.perl.org/perlfaq6.html#Why-does-using-%24%26%2C-%24%60%2C-or-%24%27-slow-my-program-down%3F): `the special variables @- and @+ can functionally replace $, $& and $'`. How would that look then for an entire `[0]` match, what to replace `$&` with, or is that not possible? – Jonny 5 Jul 30 '14 at 19:30
  • 1
    @Jonny 5, There you go, you found use, although a very very obscure one. // I can actually think of another: `/PAT/g` in list context effectively returns `$&` when the pattern doesn't contain any captures. (e.g. `say for /.\K./g` === `say for /.(.)/g`) I get chastised whenever I use that, though. – ikegami Jul 30 '14 at 19:46