1

I'm working on a sed replacement pattern that will take a file name with an absolute path and then print the unaltered input as well as the modified input (which should only be applied to the file name, not the path). Some path is guaranteed to be explicitly defined, so it will never be given something like "test.c", it would be "./test.c" in that instance.

I have some regex that is working, but only when the part to be substituted is at the start of the file name. I'd like it to work with any part of the file name.

To clarify, here's a summary of the current behavior.

INPUT           PATTERN              CURRENT OUTPUT       DESIRED BEHAVIOR?

./test.c        p;s;\(.*/\)t;\1###;  ./test.c             Yes
                                     ./###est.c

testdir/test.c  p;s;\(.*/\)tes;\1#;  testdir/test.c       Yes
                                     testdir/#t.c

./test.c        p;s;\(.*/\)est;\1#;  ./test.c             NO
                                     ./test.c             Should be ./t#.c

As you can see, this is my pattern:

p;s;\(.*/\)SubstringToFind;\1SubstringToInsert;

Here's my logic for this:

p to print the unmodified line

s for substitution (yeah, we all know that)

Capture the path: Any number of characters plus a final "/"

SubstringToFind: Look for a pattern somewhere in the file name.

\1: Recover the unmodified path

SubstringToInsert: Insert the modified file name

I'm sure this just calls for one tiny little addition like one more ".*", which I've tried adding right after the capturing parentheses, but that just replaces the specified pattern along with all file name characters before it. I'm sure the solution is staring me in the face but I'm not seeing it. Any help from someone more seasoned in regex than myself? Thanks!

shellter
  • 36,525
  • 7
  • 83
  • 90
Gregalor
  • 15
  • 3
  • Nice Q!, not sure I can help (short on time). If you don't like the formatting changes I've made, you can undo them by clicking on the `edited X time ago` link above my name and `reverting` to previous. If you like the changes, then know that you can format as code any string by surronding it in back-quotes. (Hard to display them in comments). You seem know know how to format full lines! ;-) Good luck! – shellter May 30 '19 at 12:31
  • If your final test, you're matching `testdir/` but not `est` because of the intervening `t` at the beginning of `test`. Add a 2nd capture group for a single char? Good luck. – shellter May 30 '19 at 12:56

1 Answers1

1

You were indeed very close. Your pattern for the last case looks for est immediately after the slash specified in the capture, but est isn't immediately after the slash.

You need:

$ echo ./test.c | sed -e 'p;s;\(.*/[^/]*\)est;\1#;'
./test.c
./t#.c
$

I used the negated character class to avoid matching as much as .* would, but you could also use that (and I'm not sure there's actually any significant benefit to the negated character class). A slightly more stringent test:

$ printf '%s\n' ./test.c ./southwest.c ./west-by-southwest.c south/west/southwestern.c |
> sed -e 'p;s;\(.*/[^/]*\)est;\1#;'
./test.c
./t#.c
./southwest.c
./southw#.c
./west-by-southwest.c
./west-by-southw#.c
south/west/southwestern.c
south/west/southw#ern.c
$
Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
  • This is progress, but fails in one specific edge case: substituting a single character that is the first character of the file name. p;s;(.*/[^/]*)t;\1#; on ./test.c actually outputs ./tes#.c when expected output would be ./#est.c (Sorry for all the edits) – Gregalor May 30 '19 at 17:02
  • Regular expressions are a fine balancing act between precision and simplicity — how simple can you make it without matching the wrong stuff. Depending on what you want, you might use `p;s;(.*/[^t]*t;\1#;` for this current edge case. If you want to match the first occurrence of a single letter after the slash (`t` in this case), that's the pattern to use. – Jonathan Leffler May 30 '19 at 17:39
  • So there might not be a general-purpose regex expression to handle any pattern that could be found at any spot in any given file name? At least not while the path is being captured in front of it, that's what's messing it up, I suppose. I wonder if I could separate the path and file name into separate lines and only have sed operate on the file name, then put them back together. Or just have an if-statement handling the one edge case. The end goal here is a program that uses sed and awk to rename files given a "find" substring and a replacement one. – Gregalor May 30 '19 at 17:48
  • Roughly, yes. You could use two or more captures. For example, `s;(([^/]*/)*)([^/t]*)t;\1/\3#;` has 3 captures. The innermost (second) capture is ([^/]*/) which looks for a path component such as `south/`. This is repeated zero or more times by the first capture: `(([^/]*/)*)`, which is followed by a the third capture: `([^/t]*)` and then the `t` you're looking for; this is replaced by `\1` (the first, outer capture), and then `\3` (the prefix not containing `t` or `/`), and then by the `#` replacement. And you could use a variable to hold the letter if you're very careful. _[…continued 1…]_ – Jonathan Leffler May 30 '19 at 17:55
  • _[…continuation 1…]_ However, if you want to find and replace a multi-letter string, it is harder; if you want to replace `test` with `prod`, you can't afford to list `tes` in the character class as it would stop you matching `truetest.c`. Then you go back to `([^/]*)test` to match up to the last `test` in the string; matching to the first is hard in `sed`. Again, it comes down to knowing the data. Generic regexes seldom work everywhere — hunt down the 'generic "is the email address valid"' regex question (about the most popular regex question), to see what I mean. _[…continued 2…]_ – Jonathan Leffler May 30 '19 at 17:59
  • _[…continuation 2…]_ . The fully general matcher is horrendous, even using more powerful regex mechanisms than `sed` has. So, you have to decide where the compromises are. Often, using Perl will allow you to get the job done; it has more powerful regexes than `sed` — PCRE is close to what Perl provides, of course. It has negative lookbehinds and other esoteric constructs which (probably) allow you to find the first `test` instead of the last one. – Jonathan Leffler May 30 '19 at 18:03
  • The regex for emails question is [How to validate an email address using a regular expression?](https://stackoverflow.com/questions/201323/how-to-validate-an-email-address-using-a-regular-expression) — there are three other [tag:regex]-tagged questions with more votes, one asking about validating email addresses in JavaScript. – Jonathan Leffler May 30 '19 at 18:06
  • Yes, I see what you mean. Since this is working with all but one very specific test case, I think I'll try to automatically notice and handle that one case. Thanks! – Gregalor May 30 '19 at 18:08