sed is returning more than I need

Question

Every line of the input file will match one of the patterns:

"SCnnnn"
"SC-nnnn"
"SC_nnnn"

( n=[0-9], SC is literal but may be upper or lowercase and will be followed immediately by 1-4 digits delimited at the end by an alphanumeric, space or other non-numeric character)

Somewhere in the line there will also be a file extension (matching ".abc") where abc = upper|lower alphanumeric in any position.

I want to extract the first pattern and print this together with the extracted file extension for each line. This is what I have so far:

sed -E -n 's/([Ss][Cc][-_]*[0-9][0-9]*).*(\.[a-zA-Z0-9]{3})/\1\2/p' infile

Here's a sample input line:

SCSCSCSCSCSCSCSCSC1867SCBrSCSCSCSC&SCBlSCkSCSCBSCrSCbSCckSC.xyz

with required output being:

SC1867.xyz

but what I am getting is:

SCSCSCSCSCSCSCSCSC1867.xyz

Can someone please tell me why this is returning the "SC"s before the part I want? I know it's something to do with greediness, but I can't get my head around it.

(Everything works fine where my "SCnnnn" match is at the beginning of the line.)

I am open to other tools - e.g. awk - if they offer a more straightforward solution.

EDIT: I think I found a solution - at least it appears to work:

sed -E -n 's/.*([Ss][Cc][-_]*[0-9][0-9]*).*(\.[a-zA-Z0-9]{3})/\1\2/p'

Bujiraso · Accepted Answer · 2016-02-27T22:38:45.143

It's actually not necessarily the greediness that is at play here. The reason this is happening is because sed is replacing a part of a line and then printing the whole line (the suffix of p on your s// command does this).

To more clearly see what's happening, make infile contain a more obvious string like 0o0o0o0o0o0o0o0oSC1867lalalalalalfalalala.xyz and run your first command. The following is the result

[user@localhost ~]$ sed -E -n 's/([Ss][Cc][-_]*[0-9][0-9]*).*(\.[a-zA-Z0-9]{3})/\1\2/p' infile
0o0o0o0o0o0o0o0oSC1867.xyz

As a slow-mo: sed finds your [Ss][Cc] characters beginning after the 0o0o0s and dutifully replaces the string you have described with the desired substitution; namely, it maintains the SC_-like part and four digits, then deletes everything after the numbers until the suffix. The problem is seen when the p command prints out the partially-changed line, including all of the unwanted 0oze.

Alternately

As an alternate solution, not involving printing partially changed lines but instead matching an entire line and altering it to your purpose, the following command extracted the correct answer to stdout for a file containing your example string:

[user@localhost ~]$ sed -e 's/^.*\([Ss][Cc][-_]\?[0-9]\{4\}\).*\(\.[a-Z]\{3\}\)$/\1\2/' infile
SC1867.xyz

To break that regex down a bit: the regex begins with a beginning of line (^), consumes all characters (.*) until it sees an SC (upper or lower, [Ss][Cc]), then it checks for an optional hyphen or underscore ([-_]\?), followed by exactly four digits ([0-9]\{4\}). Then, all characters are consumed until a dot (\.) is seen, followed by exactly three alphanumerical characters ([a-Z]\{3\}) and an end of line ($). The two expressions not consumed by a wildcard are saved to registers and concatenated (\1\2).

... sed -E 's/^.*([Ss][Cc][-_]?[0-9]{4}).*(\.[a-Z]{3})$/\1\2/' infile works too, if you don't enjoy backslashes as much as I do.

Thank you. That's a very useful and clear explanation. The number of digits is variable (but always at least one), but as it's always terminated with a non-digit I guess I am OK to use the [0-9][0-9]* that I have in my starting regex? Why do you escape the brackets - e.g. \{4\} I don't think I've found a need to do that. — Lorccan, Feb 28 '16 at 00:43
@Lorccan Instead of `[0-9][0-9]*` you can use `[0-9]+`. The `+` means "one or more", so that takes care of the forced `[0-9]`. `[0-9]+` works for your variable amount (more than one). If you want exactly one to four, use `[0-9]{1,4}`. As for escaping the brackets, lowercase `-e` to sed requires it, where capital `-E` doesn't seem to. Choose which-ever you like. Before your post I didn't even know about `-E` because it's not in the man pages I've seen (it might be non-standard, too, if you care about that). — Bujiraso, Feb 28 '16 at 01:17
Thanks again. (The -E is a variant for extended regex on OS X and other FreeBSD based systems, so far as I know.) — Lorccan, Feb 28 '16 at 12:14

sed is returning more than I need

1 Answers1

Alternately