How to split a string on a multi-character delimiter in Bash?

Question

Why doesn't the following Bash code work?

for i in $( echo "emmbbmmaaddsb" | split -t "mm"  )
do
    echo "$i"
done

Expected output:

e
bb
aaddsb

...huh? That's not what `split` does at all. As in, **completely** unrelated to its actual function. — Charles Duffy, Nov 18 '16 at 22:34
Do you *want* to know how to split an arbitrary string on an arbitrary multi-character separator in bash? Why not edit your question to ask that instead, if it's what you really want to know? — Charles Duffy, Nov 18 '16 at 22:36
`split` splits a file into a bunch of smaller files. Not names written to stdout, like your script expects, but actual files. And `-t` provides a single character it uses to determine where records begin and end, and thus to do those splits on record boundaries. — Charles Duffy, Nov 18 '16 at 22:39
Of course not, BECAUSE YOU'RE EXPECTING NAMES WRITTEN TO STDOUT. I already told you it doesn't write names to stdout. — Charles Duffy, Nov 18 '16 at 22:41
If nothing's written to stdout, nothing gets captured by a command substitution. — Charles Duffy, Nov 18 '16 at 22:41
Yes, it can read from a pipe. It still doesn't write to stdout, and thus still doesn't generate content that command substitution will read. — Charles Duffy, Nov 18 '16 at 22:42
Writing content into separate files no larger than a given maximum size each is the purpose that `split` exists for. Have you considered that maybe what you want might be a tool other than `split`, since that's not what you're trying to do? — Charles Duffy, Nov 18 '16 at 22:45
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/128498/discussion-between-v217-and-charles-duffy). — v217, Nov 18 '16 at 22:45

score 12 · Answer 1 · answered May 05 '17 at 16:10

12

The recommended tool for character subtitution is sed's command s/regexp/replacement/ for one regexp occurence or global s/regexp/replacement/g, you do not even need a loop or variables.

Pipe your echo output and try to substitute the characters mm witht the newline character \n:

echo "emmbbmmaaddsb" | sed 's/mm/\n/g'

The output is:

e
bb
aaddsb

answered May 05 '17 at 16:10

John Goofy

1,330
1
10
20

1

"Recommended"? See [BashFAQ #100](http://mywiki.wooledge.org/BashFAQ/100) for best-practice guidance on doing string manipulation in bash. You'll note that parameter expansion is generally considered the best-practice approach for short inputs (whereas the `echo | sed` approach, while terse, has a great deal of overhead in terms of how it's implemented under-the-hood -- requiring, typically, two forks, a mkfifo, an `execv` of an external tool which needs to be linked-and-loaded, etc). – Charles Duffy Aug 31 '17 at 10:03
1

...if you were in a tight loop processing input line-by-line, for instance (or iterating over a glob result with hundreds or thousands of filenames), calling `echo | sed` for each line would *absolutely* be an antipattern. (Calling `sed` *just once* to process the entire incoming stream, by contrast, is often appropriate). – Charles Duffy Aug 31 '17 at 10:07

Charles Duffy · Accepted Answer · 2016-11-18T23:02:59.143

Since you're expecting newlines, you can simply replace all instances of mm in your string with a newline. In pure native bash:

in='emmbbmmaaddsb'
sep='mm'
printf '%s\n' "${in//$sep/$'\n'}"

If you wanted to do such a replacement on a longer input stream, you might be better off using awk, as bash's built-in string manipulation doesn't scale well to more than a few kilobytes of content. The gsub_literal shell function (backending into awk) given in BashFAQ #21 is applicable:

# Taken from http://mywiki.wooledge.org/BashFAQ/021

# usage: gsub_literal STR REP
# replaces all instances of STR with REP. reads from stdin and writes to stdout.
gsub_literal() {
  # STR cannot be empty
  [[ $1 ]] || return

  # string manip needed to escape '\'s, so awk doesn't expand '\n' and such
  awk -v str="${1//\\/\\\\}" -v rep="${2//\\/\\\\}" '
    # get the length of the search string
    BEGIN {
      len = length(str);
    }

    {
      # empty the output string
      out = "";

      # continue looping while the search string is in the line
      while (i = index($0, str)) {
        # append everything up to the search string, and the replacement string
        out = out substr($0, 1, i-1) rep;

        # remove everything up to and including the first instance of the
        # search string from the line
        $0 = substr($0, i + len);
      }

      # append whatever is left
      out = out $0;

      print out;
    }
  '
}

...used, in this context, as:

gsub_literal "mm" $'\n' <your-input-file.txt >your-output-file.txt

arjun · Answer 3 · 2017-12-14T11:32:23.813

9

A more general example, without replacing the multi-character delimiter with a single character delimiter is given below :

Using parameter expansions : (from the comment of @gniourf_gniourf)

#!/bin/bash

str="LearnABCtoABCSplitABCaABCString"
delimiter=ABC
s=$str$delimiter
array=();
while [[ $s ]]; do
    array+=( "${s%%"$delimiter"*}" );
    s=${s#*"$delimiter"};
done;
declare -p array

A more crude kind of way

#!/bin/bash

# main string
str="LearnABCtoABCSplitABCaABCString"

# delimiter string
delimiter="ABC"

#length of main string
strLen=${#str}
#length of delimiter string
dLen=${#delimiter}

#iterator for length of string
i=0
#length tracker for ongoing substring
wordLen=0
#starting position for ongoing substring
strP=0

array=()
while [ $i -lt $strLen ]; do
    if [ $delimiter == ${str:$i:$dLen} ]; then
        array+=(${str:strP:$wordLen})
        strP=$(( i + dLen ))
        wordLen=0
        i=$(( i + dLen ))
    fi
    i=$(( i + 1 ))
    wordLen=$(( wordLen + 1 ))
done
array+=(${str:strP:$wordLen})

declare -p array

Reference - Bash Tutorial - Bash Split String

edited Dec 14 '17 at 11:32

answered Dec 04 '17 at 12:50

arjun

1,645
1
19
19

1

This is broken (will fail if string contains glob characters or spaces, etc.). Moreover, you're not using modern Bash idioms, which makes the code look really weird. You only need a simple loop: `str="LearnABCtoABCSplitABCaABCString" delimiter=ABC s=$str$delimiter array=(); while [[ $s ]]; do array+=( "${s%%"$delimiter"*}" ); s=${s#*"$delimiter"}; done; declare -p array`. That's all. – gniourf_gniourf Dec 04 '17 at 13:09
Thank you @gniourf_gniourf for the comment. I has just started with Bash Scripting, and your suggestion is really helpful to think in idiomatic approach. – arjun Dec 14 '17 at 07:12
Thank you @MallikarjunM for posting your solution (coming from a fellow Bash newbie). It helped me sort out a problem of parsing strings into arrays with a multi-character delimiter, where IFS / read array weren't suitable. – MrPotatoHead Nov 12 '18 at 14:40
@gniourf_gniourf Your "simple loop" fails for `str="Nope:" delimiter="::"` – xebeche Jun 02 '19 at 01:39
@gniourf_gniourf This should work: `s="a::b:" delimiter="::" array=(); while [[ $s ]]; do array+=( "${s%%"$delimiter"*}" ); c="${array[@]: -1}"; s="${s:${#c}}"; [[ $s != "$delimiter" ]] || { array+=(""); break; }; s="${s#"$delimiter"}"; done; declare -p array` – xebeche Jun 02 '19 at 11:10
1

Maybe run this through http://shellcheck.net/ and fix what it identifies? – Charles Duffy Jul 26 '19 at 03:47
1

as of today, the first code in the question fails to terminate with `str=a---` and `delimiter=--`; the second produces incorrect output on `str=a-----` and `delimiter=--` – jhnc Aug 02 '22 at 07:24

Noam Manos · Answer 4 · 2018-07-31T12:33:20.470

With awk you can use the gsub to replace all regex matches.

As in your question, to replace all substrings of two or more 'm' chars with a new line, run:

echo "emmbbmmaaddsb" | awk '{ gsub(/mm+/, "\n" ); print; }'

e

bb

aaddsb

The ‘g’ in gsub() stands for “global,” which means replace everywhere.

You may also ask to print just N match, for example:

echo "emmbbmmaaddsb" | awk '{ gsub(/mm+/, " " ); print $2; }'

bb

How to split a string on a multi-character delimiter in Bash?

4 Answers4

Linked

Related