1

I am trying to catch URL paths to yield the portion without leading and ending slashes /. Empty input characters before or after the trimming should be matched. The desired regex will behave as follows:

input-string        captured-string
-----------------------------------
/a/b/c/             a/b/c               
/a/b/c              a/b/c               
/                   (empty)
(empty)             (empty)

I use echo /a/b/c/d | sed -nr 's=(/(.+?)/)?=\2=p' and its flavors as the test tools as suggested by gurus and notice that the following regular expressions fail to do the job:

regex           input-string    wrong capture
---------------------------------------------
(/(.+?)/)?      /a/b/c          a/bc
(/(.+?)/)       /a/b/c          a/bc
(/(.+?)/)       /a              (doesn't match)
(/(.+?)/?)      /a/b/c/         a/b/c/
(/([^/]+)/?)    /a/b/c          ab/c
(/([^/.+])/?)   /a/b/c          ab/c
/*(.*?)/*       /a/b/c/         a/b/c/

The alleged correct answer appears to not be working, either:

echo /a/b/c | sed -nr 's=/*(?<x>.*?)/*=\k<x>=p'

because it gives this error message:

sed: -e expression #1, char 23: Invalid preceding regular expression

Helps will be much appreciated.

Edit: As pointed out by CompuChip, I used wrong test tool sed which appears to be not supporting non-greedy modifiers. The actual regex engine I am using is boost::regex_match() which gives me correct results given regex such as /?(.*?)/?. So I would like to close this question.

Community
  • 1
  • 1
Masao Liu
  • 749
  • 2
  • 7
  • 16
  • 1
    How about `^/?(.*)/?$` - i.e. matching the beginning and end of the string explicitly. – CompuChip Nov 12 '13 at 10:34
  • `echo /a/b/c/ | sed -r 's:^/?(.*)/?$:\1:g'` gives `/a/b/c/`. Maybe it is the greedy `.*` that retains the trailing `/`. – Masao Liu Nov 12 '13 at 11:40
  • I think the problem is that the echo includes a space. At least on my Windows system with GNU tools, `echo /a/b/c/ | sed -r s:^/?(.*)/?$:{\1}:g` returns `{a/b/c/ }`. – CompuChip Nov 12 '13 at 12:30
  • Strange! In Debian Wheezy I get `{a/b/c/}` with exactly the same input and regex as yours. If I save the only line `/a/b/c/` in file `input` and then `sed -r 's:^/?(.*)/?$:{\1}:g' input`, I get `{a/b/c/}`, too. – Masao Liu Nov 12 '13 at 13:53
  • So I think the problem is related, as you say, to greedy matching. If the string ends in `/`, apparently `sed` prefers putting that inside the match and not matching on the final `/?$`. Unfortunately, it is not possible to disable greedy matching in `sed` (http://stackoverflow.com/questions/1103149/non-greedy-regex-matching-in-sed) but since I read below that you are actually using boost, perhaps `^/?(.*?)/?$` will work (i.e. my original suggestion with a non-greedy `?` modifier added). – CompuChip Nov 12 '13 at 14:02
  • @CompuChip Thank you for the correct answer! Using incorrect test tool `sed` wasted myself one full day. – Masao Liu Nov 12 '13 at 14:47

2 Answers2

1

Try following sed

sed -r 's:^/|/$::'

Short Description

Match : ^/|/$ = ^/ or /$ i.e. leading and trailing slash

Replace : (empty) i.e. trim the match

Test

$ cat file
/a/b/c/
/a/b/c
/

$ sed -r 's:^/|/$::' file
a/b/c/
a/b/c
jkshah
  • 11,387
  • 6
  • 35
  • 45
  • 1
    As more than one match is possible you need to add the `g` flag to cater for this condition. – potong Nov 12 '13 at 11:00
  • @potong As per my understanding, `g` flag will do global match in a single line. Here `^/` or `/$` can happen only once in every line. Hence I don't see that requirement although adding `g` won't hurt. Please correct me if I missed something. Counter-example would be helpful – jkshah Nov 12 '13 at 11:36
  • I should have mentioned that I must use group, capture, and back reference. In fact I am using boost `regex_match`. – Masao Liu Nov 12 '13 at 11:44
  • @MasaoLiu Instead can't you use boost `regex_replace` and do above replacement? once it's done, all output strings are same as required capture only. Are you not trying to solve this difficult way when it's much easier other way round? Am I missing something? bdw you have wrongly tagged `sed` and even showed `sed` examples. – jkshah Nov 12 '13 at 12:39
  • @jkshah Oops! I thought mentioning `boost regex_match` instead of the actual framework I am actually using, `cppCMS`, narrows down my issue. `cppcms::application::attach()` expects regular expression as parameter. Having received the URL from clients, the framework internally calls the related function according to the captured characters returned by `boost regex_match()` given the aforementioned regular expression. By the way, I have just removed tag `sed`. – Masao Liu Nov 12 '13 at 14:23
  • @jkshah the title says `Regex to trim off leading and trailing slashes and catch the rest` and the first example `/a/b/c/`conveys this. – potong Nov 12 '13 at 18:39
  • @potong I agree. My point here was, output of `sed` can be easily captured in array of string. Anyways if `sed` is not the right tool, I might be out of league. – jkshah Nov 12 '13 at 18:42
0

if only this kind of entry (so not inside other string)

sed "s#$#/#;s#^[^/].*##;s#/*$##;s#^/##"

Don't avoid thing like //bad/path/

NeronLeVelu
  • 9,908
  • 1
  • 23
  • 43