Extracting text with sed and conditionally present white space

Question

I am trying to dynamically find a directory the be used programmatically later on in a script. The problem that I am having is accounting for white space which may or may not be there.

Using the following example because it outputs three strings separated by spaces (paths in this case). Let's assume I want to get the directory for the man pages for a particular command (forgetting, for a moment, that there are builtin ways to do this) using whereis:

$ whereis bash
bash: /bin/bash /usr/local/man/man1/bash.1.gz /usr/ports/shells/bash

I would like to extract any one of the directories. Using sed, I came up with the following:

$ whereis bash | sed -En 's:.*[" "](.*man.*)[" "].*:\1:p'
/usr/local/man/man1/bash.1.gz

That works great if pattern happens to be in the middle, but if it happens to be at the beginning or ending of the string, I have to remove the space from the pattern to get it to work (using "port" for the pattern as an example)

$ whereis bash | sed -En 's:.*[" "](.*port.*)[" "].*:\1:p'

$ whereis bash | sed -En 's:.*[" "](.*port.*).*:\1:p'
/usr/ports/shells/bash

The same holds true if I wanted to extract the directory with the pattern "bin" in it.

How do I "tell" sed that the pattern may contain a certain character.

Why am I doing this?

When I try it without the spaces, I get the following:

$ whereis bash | sed -En 's:.*(.*man.*).*:\1:p'
man1/bash.1.gz /usr/ports/shells/bash

I don't get the full path of the text I wanted and it adds on a path that I totally don't want. The space is a delimiter.

I have used this post: How to output only captured groups with sed? and this post: sed - how to do regex groups using sed as reference and a jumping off point.

Also of note, I tried using the regex \s for white space, but it was ignored. I'm also on FreeBSD so I am using the -E for regex.

If there's another way to approach this, a point in the right direction would be greatly appreciated; I;m very very new to working with sed and awk.

If you are looking for just man pages, it might help to use `-m` as in `whereis -m bash` so that only man pages are returned. (The Util-Linux `whereis` has a `-m` option. Freebsd may, of course, vary.) — John1024, Mar 05 '18 at 22:37
@John1024 That's a good tip.. FreeBSD has that flag. In this question though, I just happen to be using `whereis`. It could be any commadn that returns a number of strings separated by spaces. — Allan, Mar 05 '18 at 22:41
So, to summarize, for some unspecified command that returns a space-separated list of items where each item may include spaces, you want to separate the items? — John1024, Mar 05 '18 at 22:49
Since it seems you are using bash, this could save a lot of effort : `read -a arr < <(whereis bash)`. It will will create an array with name `arr` using bash IFS as delimiter (space,newline and tab). You can loop over the array as usuall with bash : `printf '%s\n' "${arr[@]}"` — George Vasiliou, Mar 05 '18 at 23:49
@GeorgeVasiliou - I'm actually using `sh` and it unfortunately doesn't support arrays. — Allan, Mar 06 '18 at 00:38
@John1024 - correct. It could return directory paths or URLs. In this case, I am looking for directory paths. — Allan, Mar 06 '18 at 00:39
Could any of the values themselves contain spaces (or newlines, or tabs)? You are assuming not; it is likely safe to do so, but it should be a conscious decision, not an accidental one. Does it have to be `sed`? It would be very easy in `awk`: `whereis bash | awk '{for (i = 1; i <= NF; i++) if ($i ~ /man/) print $i }'` would print those (blank/tab separated) fields that match `man`; you could make the match more precise by looking for `/\/man[a-z0-9]*\//` which demands a name such as `/usr/share/man/man1/bash.1.gz` (the pattern matches twice on this name). — Jonathan Leffler, Mar 06 '18 at 01:42
@JonathanLeffler - when I put in a literal space surrounded by the square brackets, it works. See my example. — Allan, Mar 06 '18 at 11:26

Quetza · Answer 1 · 2018-03-07T12:18:39.333

sed might not be the right tool for this task. You could iterate over output with something like:

foreach f in `whereis bash` ; do
    echo $f | grep /man/
done

To solve the specific whereis question, better to use the built-in FreeBSD options to return the binary, man page or source with -b, -m and -s. Combine it with the -q (quiet) option and you get something designed for use in scripts. So:

 whereis -mq bash

will return /usr/local/man/man1/bash.1.gz

If your use case is something else and you absolutely must use sed, this should give what you're looking for:

whereis bash | sed -E 's|^.*[[:space:]]+([^[:space:]]+man[^[:space:]]+).*$|\1|'

FreeBSD 11 regular expressions are IEEE Std 1003.2 (POSIX.2) compliant which does not support \s\S notation. As such, you need to use the [[:space:]] character class. More info can be found through the re_format(7) man page.

That's not my specific question. I edited it so it's much clearer. — Allan, Mar 06 '18 at 11:30

score 0 · Answer 2 · answered Mar 05 '18 at 22:49

If you want to use regular expressions, you need to consider that they are "greedy" (the * tries to match as far as possible), so you need to limit that by looking for whitespace before the expression (which can be done with a \s) and only continuing on the expression while you see non-whitespace (which can be done with \S).

So this should work:

whereis bash | sed -En 's:.*\s(\S*man\S*).*:\1:p'

Though I find you can more easily handle this in a bash function, in which case you can handle words one at a time and you can do the matching using simpler globs rather than regexes.

For example:

find_manpage() {
    local tool=$1
    local path
    set -- $(whereis "${tool}")
    for path ; do
        if [[ "${path}" == *man* ]] ; then
            echo "${path}"
            return 0
        fi
    done
    return 1
}

And use it like:

find_manpage bash

Or:

manpage_path=$(find_manpage bash)

You could easily extend that function to take a "pattern" as second argument and match on that too, making it more general than just finding the manpage.

I hope this helps!

I tried the command you specified an it unfortunately didn't work. I get no result. — Allan, Mar 06 '18 at 00:41
@Allan That's odd... I wonder if your `sed -E` does the same as the one I have (I'm on Linux.) In case it doesn't support it, you can use an actual space and a negated group for the non-space part: `whereis bash | sed -En 's:.* ([^ ]*man[^ ]*).*:\1:p'`. I still think the shell function is a superior solution, more elegant... Did you try that one? — filbranden, Mar 06 '18 at 04:56

Extracting text with sed and conditionally present white space

2 Answers2