3

I would like to split words that have hyphens in them using sed. Hyphens which are not inside words should stay as-is. For example, for the sentence:

"the multi-modal solution is an award-winning approach in the 21st-century - however"

I would like the output:

"the multi @-@ modal solution is an award @-@ winning approach in the 21st @-@ century - however"

I tried using:

sed 's/\([a-zA-Z0-9]+\)-\([a-zA-Z0-9]+\)/\1 @-@ \2/g' test.txt > test2.txt

Without success. I'm using the OSX version of sed.

Roee Aharoni
  • 141
  • 1
  • 7
  • does your input have `-` other than between words? otherwise for given sample input, a simple `sed 's/-/ @-@ /g'` would work – Sundeep Feb 16 '17 at 10:41
  • 1
    the problem with your attempt is that `+` is not a meta character for default BRE... GNU sed would allow to use `\+` – Sundeep Feb 16 '17 at 10:44
  • my input has `-` which are not between words or in the end of words, which i wouldn't like to split – Roee Aharoni Feb 16 '17 at 11:16

4 Answers4

2

You can use this non-regex implementation using awk:

s="the multi-modal solution is an award-winning approach in the 21st-century"
awk -F '-' -v OFS=' @-@ ' '{$1=$1} 1' <<< "$s"

the multi @-@ modal solution is an award @-@ winning approach in the 21st @-@ century

Reference: Effective AWK Programming

Sed solution (works on OSX):

sed -E 's/([^-[:blank:]]+)-([^-[:blank:]]+)/\1 @-@ \2/g' <<< "$s"
anubhava
  • 761,203
  • 64
  • 569
  • 643
2

To complement the sed -E solution in anubhava's answer with a fixed version of your own solution attempt:

sed 's/\([a-zA-Z0-9]\{1,\}\)-\([a-zA-Z0-9]\{1,\}\)/\1 @-@ \2/g' test.txt > test2.txt

That is, the ERE (extended regex) quantifier construct + must be emulated with \{1,\} in a BRE (basic regex), which sed uses by default.


Optional background information

As Sundeep points in out in a comment on the question, GNU sed allows use of \+ (when not using -r / -E, which enables support for EREs), but that is a nonstandard extension not supported by the macOS sed version.

The sed POSIX spec only supports BREs, specifically, POSIX BREs.

Therefore, to write portable sed commands:

  • Use neither -r (GNU sed an more recent versions of BSD sed) nor -E (both GNU and BSD/macOS sed)

  • Use only POSIX BRE features, avoiding implementation-specific extensions, notably:

    • Use \{1,\} instead of \+ (the equivalent of ERE +).
    • Use \{0,1\} instead of \? (the equivalent of ERE ?).
    • Avoid GNU's \| for alternation: unfortunately, POSIX BREs do not support alternation at all.

To take advantage of the more powerful, modern-syntax EREs while supporting platforms with both GNU and BSD sed (including macOS):


To learn about a given sed implementation's specific (nonstandard) regex features:

  • GNU Sed (Linux):

    • info sed, as of GNU Sed 4.2.2, explains

      • GNU BRE syntax in chapter "3.3 Overview of Regular Expression Syntax"

        • BRE extensions are \+, \?, and \|; that a** is treated the same as a* (without having to escape the 2nd *) is only true for EREs.
      • GNU ERE syntax in "Appendix A Extended regular expressions".

        • However, only the contrast with BREs is discussed, and the many ERE extensions - among them character-class shortcuts such as \d and \s, word-boundary assertions such as \< / \> and \b, control-character escape sequences in addition \n, such as \t, and codepoint-based escape sequences such as \x27 - are not mentioned there.
    • (By contrast, man re_format / man 7 regex contain only POSIX info.)

  • BSD / macOS Sed:

    • man re_format does apply (discusses both BREs and EREs), except for the section about enhanced features, which aren't supported.
    • The only extensions mentioned are word-boundary assertions [[:<:]] and [[:>:]]

For a comprehensive overview of all differences between GNU Sed and BSD Sed, see this answer of mine.

Community
  • 1
  • 1
mklement0
  • 382,024
  • 64
  • 607
  • 775
1

This might work for you (GNU sed):

sed 's/\>-\</ @-@ /g' file

Replace hypens surrounded by end/start of word boundaries with the required string.

potong
  • 55,640
  • 6
  • 51
  • 83
0
  s="the multi-modal solution is an award-winning approach in the 21st-century - however"
awk -F century '{gsub(/-/," @&@ ",$1)}1'  <<< "$s" OFS=century

the multi @-@ modal solution is an award @-@ winning approach in the 21st @-@ century - however
Claes Wikner
  • 1,457
  • 1
  • 9
  • 8