3

I have the following pcre that works just fine:

/[c,f]=("(?:[a-z A-Z 0-9]|-|_|\/)+\.(?:js|html)")/g

It produces the desired output "foo.js" and "bar.html" from the inputs

<script src="foo.js"...
<link rel="import" href="bar.html"...

Problem is, the OS X version of grep doesn't seem to have any option like -o to only print the captured group (according to another SO question, that apparently works on linux). Since this will be part of a makefile, I need a version that I can count on running on any *nix platform.

I tried sed but the following

s/[c,f]=("(?:[[:alphanum:]]|-|_|\/)+\.(?:js|html)")/\1/pg

Throws an error: 'invalid operand for repetition-operator'. I've tried trimming it down, excluding the filepath separator characters, I just cant seem to crack it. Any help translating my pcre into something that I'm pretty much guaranteed to have on a POSIX-compliant (even unofficially so) platform?

P.S. I'm aware of the potential failure modes inherent in the regex I wrote, it only will be used against very specific files with fairly specific formatting.

Lucas Trzesniewski
  • 50,214
  • 11
  • 107
  • 158
Jared Smith
  • 19,721
  • 5
  • 45
  • 83
  • 1
    There's no such thing as *"universal"* regex, FTFY ;-) – Lucas Trzesniewski Feb 28 '16 at 21:34
  • 1
    @LucasTrzesniewski no but there are universal POSIX utilities like grep and sed. Problem was that grep flags are apparently platform dependent (along BSD/gnu lines) and I couldn't figure out how to do it in sed. I just can't count on pcregrep or gnu grep. Your edit was certainly cogent though. – Jared Smith Feb 28 '16 at 22:23

3 Answers3

3

POSIX defines two flavors of regular expressions:

  • BREs (Basic Regular Expressions) - the older flavor with fewer features and the need to \-escape certain metacharacters, notably \(, \) and \{, \}, and no support for duplication symbols \+ (emulate with \{1,\}) and \? (emulate with \{0,1\}), and no support for \| (alternation; cannot be emulated).

  • EREs (Extended Regular Expressions) - the more modern flavor, which, however lacks regex-internal back-references (which is not the same as capture groups); also there is no support for word-boundary assertions (e.g, \<) and no support for capture groups.

POSIX also mandates which utilities support which flavor: which support BREs, which support EREs, and which optionally support either, and which exclusively support only BREs, or only EREs; notably:

  • grep uses BREs by default, but can enable EREs with -E
  • sed, sadly, only supports BREs
    • Both GNU and BSD sed, however, - as a nonstandard extension - do support EREs with the -E switch (the better known alias with GNU sed is -r, but -E is supported too).
  • awk only supports EREs

Additionally, the regex libraries on both Linux and BSD/OSX implement extensions to the POSIX ERE syntax - sadly, these extensions are in part incompatible (such as the syntax for word-boundary assertions).

As for your specific regex:

It uses the syntax for non-capturing groups, (?:...); however, capture groups are pointless in the context of grep, because grep offers no replacement feature.

If we remove this aspect, we get:

[c,f]=("([a-z A-Z 0-9]|-|_|\/)+\.(js|html)") 

This is now a valid POSIX ERE (which can be simplified - see Benjamin W's helpful answer).
However, since it is an Extended RE, using sed is not an option, if you want to remain strictly POSIX-compliant.

Because both GNU and BSD/OSX sed happen to implement -E to support EREs, you can get away with sed, if these platforms are the only ones you need to support - see anubhava's answer.

Similarly, both GNU and BSD/OSX grep happen to implement the nonstandard -o option (unlike what you state in your question), so, again, if these platforms are the only ones you need to support, you can use:

$ grep -Eo '[c,f]=("([a-z A-Z 0-9]|-|_|\/)+\.(js|html)")' file | cut -c 3-
c="foo.js"
f="bar.html"

(Note that only GNU grep supports -P to enable PCREs, which would simply the solution to (note the \K, which drops everything matched so far):

$ grep -Po '[c,f]=\K("([a-z A-Z 0-9]|-|_|\/)+\.(js|html)")' file

)

If you really wanted a strictly POSIX-compliant solution, you could use awk:

$ awk -F\" '/[c,f]=("([a-z A-Z 0-9]|-|_|\/)+\.(js|html)")/ { print "\"" $2 "\"" }' file
Community
  • 1
  • 1
mklement0
  • 382,024
  • 64
  • 607
  • 775
2

On OSX following sed should work with your given input:

sed -E 's~.*[cf]=("[ a-zA-Z0-9_/-]+\.(js|html)").*~\1~' file

"foo.js"
"bar.html"

RegEx Demo

anubhava
  • 761,203
  • 64
  • 569
  • 643
  • That works as expected. Explanation would be helpful about how this works. It looks like you've used tilde as the separator instead of slash but I'm not sure I understand the differences between this and my original regex. – Jared Smith Feb 28 '16 at 20:37
  • 1
    ++; it works with GNU `sed` too (GNU `sed` accepts `-E` as an alias of `-r`). – mklement0 Feb 29 '16 at 02:53
  • @JaredSmith: Actually it is same PCRE regex that you you had in question. I just refactored it a bit to remove remove redundant comma from first `[...]`, converted `-|_|\/` to `[_/-]` and removed non-capturing group as that is not supported in ERE or BRE. – anubhava Feb 29 '16 at 04:08
  • In other words you could also use: `sed -E 's/.*[c,f]=("([a-z A-Z 0-9]|-|_|\/)+\.(js|html)").*/\1/' file` with just removal of `(?:...)` as non-capturing group as that is not supported in `sed` – anubhava Feb 29 '16 at 04:10
2

The spec for POSIX sed points out that only basic regular expressions (BRE) are supported, so no + or |; non-capturing groups aren't even in the spec for extended regular expressions (ERE).

Thankfully, both GNU sed and BSD sed support ERE, so we can use alternation and the + quantifier.

A few points:

  • Did you really want that comma in the first bracket expression? I suspect it could be just [cf].
  • The expression

    (?:[a-z A-Z 0-9]|-|_|\/)+
    

    can be simplified to a single bracket expression,

    [a-zA-Z0-9_\/ -]+
    

    Only one space is needed. You can also use a POSIX character class: [[:alnum:]]_/ -]+. Not sure if your [:alphanum:] tripped sed up.

  • For the whole expression between quotes, I'd just use an expression for "something between quotes, ending in .js or .html, preceded by non-quotes":

    "[^"]+\.(js|html)"
    
  • To emulate grep -o behaviour, you have to also match everything before and after your expression on the line with .* at the start and end of your regex.

All in all, I'd say that for a sed using ERE (-r option for GNU sed, -E option for BSD sed), this should work:

sed -rn 's/.*[cf]=("[^"]+\.(js|html)").*/\1/p' infile

Or, with BRE only (requiring two commands because of the alternation):

sed -n 's/.*[cf]=\("[^"][^"]*\.js"\).*/\1/p;s/.*[cf]=\("[^"][^"]*\.html"\).*/\1/p' infile

Notice how BRE can emulate the + quantifier with [abc][abc]* instead of [abc]+.

The limitation to this approach is that if there are multiple matches on the same line, only the first one will be printed, because the s/// command removes everything before and after the part we extract.

Benjamin W.
  • 46,058
  • 19
  • 106
  • 116
  • Thanks for the breakdown. And no, there won't be multiple matches per line, the problem is there's a conditional in a makefile I'm writing that either cats the file or strips out the regex matches, the file structure is known. As for the alphanum thing, I pulled that from some documentation site that was clearly written in the early-mid nineties. – Jared Smith Feb 29 '16 at 00:35
  • ++; note that GNU `sed` _also_ accepts `-E` (in lieu of `-r`), even though the man page doesn't say so. – mklement0 Feb 29 '16 at 02:47
  • @mklement0 So it does! Interesting. Brings it in line with `grep -E`. – Benjamin W. Feb 29 '16 at 03:38
  • Yes, though, unfortunately, with `sed -E` you really have to limit yourself to the _POSIX_ ERE features (and many other POSIX-mandated `sed` behaviors - see [this answer](http://stackoverflow.com/a/24276470/45375) of mine for the gory details) to make a given command work on both Linux and BSD/OSX; with `grep -E`, there is more overlap in the nonstandard ERE extensions. – mklement0 Feb 29 '16 at 03:45