Translate PCRE pattern to POSIX

Question

I have the following pcre that works just fine:

/[c,f]=("(?:[a-z A-Z 0-9]|-|_|\/)+\.(?:js|html)")/g

It produces the desired output "foo.js" and "bar.html" from the inputs

<script src="foo.js"...
<link rel="import" href="bar.html"...

Problem is, the OS X version of grep doesn't seem to have any option like -o to only print the captured group (according to another SO question, that apparently works on linux). Since this will be part of a makefile, I need a version that I can count on running on any *nix platform.

I tried sed but the following

s/[c,f]=("(?:[[:alphanum:]]|-|_|\/)+\.(?:js|html)")/\1/pg

Throws an error: 'invalid operand for repetition-operator'. I've tried trimming it down, excluding the filepath separator characters, I just cant seem to crack it. Any help translating my pcre into something that I'm pretty much guaranteed to have on a POSIX-compliant (even unofficially so) platform?

P.S. I'm aware of the potential failure modes inherent in the regex I wrote, it only will be used against very specific files with fairly specific formatting.

@LucasTrzesniewski no but there are universal POSIX utilities like grep and sed. Problem was that grep flags are apparently platform dependent (along BSD/gnu lines) and I couldn't figure out how to do it in sed. I just can't count on pcregrep or gnu grep. Your edit was certainly cogent though. — Jared Smith, Feb 28 '16 at 22:23

score 3 · Answer 1 · edited May 23 '17 at 12:23

POSIX defines two flavors of regular expressions:

BREs (Basic Regular Expressions) - the older flavor with fewer features and the need to \-escape certain metacharacters, notably \(, \) and \{, \}, and no support for duplication symbols \+ (emulate with \{1,\}) and \? (emulate with \{0,1\}), and no support for \| (alternation; cannot be emulated).
EREs (Extended Regular Expressions) - the more modern flavor, which, however lacks regex-internal back-references (which is not the same as capture groups); also there is no support for word-boundary assertions (e.g, \<) and no support for capture groups.

POSIX also mandates which utilities support which flavor: which support BREs, which support EREs, and which optionally support either, and which exclusively support only BREs, or only EREs; notably:

grep uses BREs by default, but can enable EREs with -E
sed, sadly, only supports BREs
- Both GNU and BSD sed, however, - as a nonstandard extension - do support EREs with the -E switch (the better known alias with GNU sed is -r, but -E is supported too).
awk only supports EREs

Additionally, the regex libraries on both Linux and BSD/OSX implement extensions to the POSIX ERE syntax - sadly, these extensions are in part incompatible (such as the syntax for word-boundary assertions).

As for your specific regex:

It uses the syntax for non-capturing groups, (?:...); however, capture groups are pointless in the context of grep, because grep offers no replacement feature.

If we remove this aspect, we get:

[c,f]=("([a-z A-Z 0-9]|-|_|\/)+\.(js|html)")

This is now a valid POSIX ERE (which can be simplified - see Benjamin W's helpful answer).
However, since it is an Extended RE, using sed is not an option, if you want to remain strictly POSIX-compliant.

Because both GNU and BSD/OSX sed happen to implement -E to support EREs, you can get away with sed, if these platforms are the only ones you need to support - see anubhava's answer.

Similarly, both GNU and BSD/OSX grep happen to implement the nonstandard -o option (unlike what you state in your question), so, again, if these platforms are the only ones you need to support, you can use:

$ grep -Eo '[c,f]=("([a-z A-Z 0-9]|-|_|\/)+\.(js|html)")' file | cut -c 3-
c="foo.js"
f="bar.html"

(Note that only GNU grep supports -P to enable PCREs, which would simply the solution to (note the \K, which drops everything matched so far):

$ grep -Po '[c,f]=\K("([a-z A-Z 0-9]|-|_|\/)+\.(js|html)")' file

)

If you really wanted a strictly POSIX-compliant solution, you could use awk:

$ awk -F\" '/[c,f]=("([a-z A-Z 0-9]|-|_|\/)+\.(js|html)")/ { print "\"" $2 "\"" }' file

I only need to worry about BSD, OS X, and linux, so the other answer is fine for my needs, but thank you for the more portable solution anyways. — Jared Smith, Feb 29 '16 at 01:00
@JaredSmith: My pleasure; you should accept the other answer, then. — mklement0, Feb 29 '16 at 02:58

score 2 · Accepted Answer · answered Feb 28 '16 at 19:06

2

On OSX following sed should work with your given input:

sed -E 's~.*[cf]=("[ a-zA-Z0-9_/-]+\.(js|html)").*~\1~' file

"foo.js"
"bar.html"

RegEx Demo

answered Feb 28 '16 at 19:06

anubhava

761,203
64
569
643

That works as expected. Explanation would be helpful about how this works. It looks like you've used tilde as the separator instead of slash but I'm not sure I understand the differences between this and my original regex. – Jared Smith Feb 28 '16 at 20:37
1

++; it works with GNU `sed` too (GNU `sed` accepts `-E` as an alias of `-r`). – mklement0 Feb 29 '16 at 02:53
@JaredSmith: Actually it is same PCRE regex that you you had in question. I just refactored it a bit to remove remove redundant comma from first `[...]`, converted `-|_|\/` to `[_/-]` and removed non-capturing group as that is not supported in ERE or BRE. – anubhava Feb 29 '16 at 04:08
In other words you could also use: `sed -E 's/.*[c,f]=("([a-z A-Z 0-9]|-|_|\/)+\.(js|html)").*/\1/' file` with just removal of `(?:...)` as non-capturing group as that is not supported in `sed` – anubhava Feb 29 '16 at 04:10

score 2 · Answer 3 · answered Feb 29 '16 at 00:28

The spec for POSIX sed points out that only basic regular expressions (BRE) are supported, so no + or |; non-capturing groups aren't even in the spec for extended regular expressions (ERE).

Thankfully, both GNU sed and BSD sed support ERE, so we can use alternation and the + quantifier.

A few points:

Did you really want that comma in the first bracket expression? I suspect it could be just [cf].
The expression
```
(?:[a-z A-Z 0-9]|-|_|\/)+
```
can be simplified to a single bracket expression,
```
[a-zA-Z0-9_\/ -]+
```
Only one space is needed. You can also use a POSIX character class: [[:alnum:]]_/ -]+. Not sure if your [:alphanum:] tripped sed up.
For the whole expression between quotes, I'd just use an expression for "something between quotes, ending in .js or .html, preceded by non-quotes":
```
"[^"]+\.(js|html)"
```
To emulate grep -o behaviour, you have to also match everything before and after your expression on the line with .* at the start and end of your regex.

All in all, I'd say that for a sed using ERE (-r option for GNU sed, -E option for BSD sed), this should work:

sed -rn 's/.*[cf]=("[^"]+\.(js|html)").*/\1/p' infile

Or, with BRE only (requiring two commands because of the alternation):

sed -n 's/.*[cf]=\("[^"][^"]*\.js"\).*/\1/p;s/.*[cf]=\("[^"][^"]*\.html"\).*/\1/p' infile

Notice how BRE can emulate the + quantifier with [abc][abc]* instead of [abc]+.

The limitation to this approach is that if there are multiple matches on the same line, only the first one will be printed, because the s/// command removes everything before and after the part we extract.

Thanks for the breakdown. And no, there won't be multiple matches per line, the problem is there's a conditional in a makefile I'm writing that either cats the file or strips out the regex matches, the file structure is known. As for the alphanum thing, I pulled that from some documentation site that was clearly written in the early-mid nineties. — Jared Smith, Feb 29 '16 at 00:35
++; note that GNU `sed` _also_ accepts `-E` (in lieu of `-r`), even though the man page doesn't say so. — mklement0, Feb 29 '16 at 02:47
@mklement0 So it does! Interesting. Brings it in line with `grep -E`. — Benjamin W., Feb 29 '16 at 03:38
Yes, though, unfortunately, with `sed -E` you really have to limit yourself to the _POSIX_ ERE features (and many other POSIX-mandated `sed` behaviors - see [this answer](http://stackoverflow.com/a/24276470/45375) of mine for the gory details) to make a given command work on both Linux and BSD/OSX; with `grep -E`, there is more overlap in the nonstandard ERE extensions. — mklement0, Feb 29 '16 at 03:45

Translate PCRE pattern to POSIX

3 Answers3