1

I have folders in a directory with names giving specific information. For example:

[allied]_remarkable_points_[treatment]

[nexus]_advisory_plans_[inspection]

....

So I have a structure similar to this: [company]_title_[topic]. The script has to match the file naming structure to variables in a script in order to extract the information:

COMPANY='[allied]';
TITLE='remarkable points'
TOPIC='[treatment]'

The folders do not contain a constant number of characters, so I can't use indexed matching in the script. I managed to extract $TITLE and $TOPIC, but I can't manage to match the first string since the variable brings me back the complete folders name.

FOLDERNAME=${PWD##*/}

This is the line is giving me grief:

COMPANY=`expr $FOLDERNAME : '\(\[.*\]\)'`

I tried to avoid the greedy behaviour by placing ? in the regular expression:

COMPANY=`expr $FOLDERNAME : '\(\[.*?\]\)'`

but as soon as I do that, it returns nothing

Any ideas?

marbu
  • 1,939
  • 2
  • 16
  • 30
GLR
  • 157
  • 2
  • 7
  • An example input/output pair is worth a million words (of course more than one is better). – 4ae1e1 Nov 13 '15 at 06:01
  • Quote `$FOLDERNAME` to avoid the brackets from being treated as glob characters following the expansion, but see my answer for how to avoid `expr` altogether in `bash`. – chepner Nov 13 '15 at 13:31

3 Answers3

1

Bash has built-in string manipulation functionality.

for f in *; do
    company=${f%%\]*}
    company=${company#\[}  # strip off leading [
    topic=${f##\[}
    topic=${f%\]}          # strip off trailing ]
    :
done

The construct ${variable#wildcard} removes any prefix matching wildcard from the value of variable and returns the resulting string. Doubling the # obtains the longest possible wildcard match instead of the shortest. Using % selects suffix instead of prefix substitution.

If for some reason you do want to use expr, the reason your non-greedy regex attempt doesn't work is that this syntax is significantly newer than anything related to expr. In fact, if you are using Bash, you should probably not be using expr at all, as Bash provides superior built-in features for every use case where expr made sense, once in the distant past when the sh shell did not have built-in regex matching and arithmetic.

Fortunately, though, it's not hard to get non-greedy matching in this isolated case. Just change the regex to not match on square brackets.

COMPANY=`expr "$FOLDERNAME" : '\(\[[^][]*\]\)'`

(The closing square bracket needs to come first within the negated character class; in any other position, a closing square bracket closes the character class. Many newbies expect to be able to use backslash escapes for this, but that's not how it works. Notice also the addition of double quotes around the variable.)

Community
  • 1
  • 1
tripleee
  • 175,061
  • 34
  • 275
  • 318
1

expr isn't needed for regular-expression matching in bash.

[[ $FOLDERNAME =~ (\[[^]]*\]) ]] && COMPANY=${BASH_REMATCH[1]}

Use [^]]* instead of .* to do a non-greedy match of the bracketed portion. An bigger regular expression can capture all three parts:

[[ $FOLDERNAME =~ (\[[^]]*\])_([^_]*)_(\[[^]]*\]) ]] && {
    COMPANY=${BASH_REMATCH[1]}
    TITLE=${BASH_REMATCH[2]}
    TOPIC=${BASH_REMATCH[3]}
}
chepner
  • 497,756
  • 71
  • 530
  • 681
0

If you're not adverse to using grep, then:

COMPANY=$(grep -Po "^\[.*?\]" $FOLDERNAME)
madsen
  • 422
  • 3
  • 9