39

I have two strings. For the sake of the example they are set like this:

string1="test toast"
string2="test test"

What I want is to find the overlap starting at the beginning of the strings. With overlap I mean the string "test t" in my above example.

# So I look for the command 
command "$string1" "$string2"
# that outputs:
"test t"

If the strings were string1="atest toast"; string2="test test" they would have no overlap since the check starts from the beginning and the "a" at the start of string1 .

pynexj
  • 19,215
  • 5
  • 38
  • 56
con-f-use
  • 3,772
  • 5
  • 39
  • 60
  • ohh man, it's good to see that others struggled with this as well :D – Karoly Horvath Aug 07 '11 at 13:40
  • @ajreal: The function provided there is rather lengthy and does not work with spaces in the strings. None the less my question is a duplicate. Sorry for that. Will post a comment there – con-f-use Aug 07 '11 at 13:42
  • 1
    Not a duplicate: the intersection needs are not the same. – jfg956 Aug 07 '11 at 14:10
  • 4
    Please do not cross post between sites! [How do I find the overlap of two strings in bash?](http://unix.stackexchange.com/q/18236) – Caleb Aug 07 '11 at 17:17

14 Answers14

34

In sed, assuming the strings don't contain any newline characters:

string1="test toast"
string2="test test"
printf "%s\n%s\n" "$string1" "$string2" | sed -e 'N;s/^\(.*\).*\n\1.*$/\1/'
Gilles 'SO- stop being evil'
  • 104,111
  • 38
  • 209
  • 254
jfg956
  • 16,077
  • 4
  • 26
  • 34
  • 6
    Note not all seds support "\n" in the substitute commands ([Apple's doesn't](https://developer.apple.com/library/mac/documentation/Darwin/Reference/ManPages/man1/sed.1.html)), but [Gnu's sed](https://www.gnu.org/software/sed/manual/sed.html) does. Readers might need to run `gsed` instead of `sed`. – outis Sep 12 '15 at 04:19
  • 2
    GNU sed also supports `\x0`, `printf '%s\x0%s' "$string1" "$string2" | sed 's/\(.*\).*\x0\1.*/\1/'` is even safer. If you're dealing with pathnames and want a common path prefix, sub in `\(.*/\)` for `\(.*\)` – jthill Jan 21 '17 at 20:02
  • @jthill has a good idea but the sed command also has to be modified to handle the newlines, something like: ``printf '%s\x0%s\n' "$string1" "$string2" | sed 'H;$!d;g;s/\`.\(.*\).*\x0\1.*/\1/'`` – Dan R Jan 15 '18 at 13:59
21

An improved version of the sed example, this finds the common prefix of N strings (N>=0):

string1="test toast"
string2="test test"
string3="teaser"
{ echo "$string1"; echo "$string2"; echo "$string3"; } | sed -e 'N;s/^\(.*\).*\n\1.*$/\1\n\1/;D'

If the strings are stored in an array, they can be piped to sed with printf:

strings=("test toast" "test test" "teaser")
printf "%s\n" "${strings[@]}" | sed -e '$!{N;s/^\(.*\).*\n\1.*$/\1\n\1/;D;}'

You can also use a here-string:

strings=("test toast" "test test" "teaser")
oIFS=$IFS
IFS=$'\n'
<<<"${strings[*]}" sed -e '$!{N;s/^\(.*\).*\n\1.*$/\1\n\1/;D;}'
IFS=$oIFS
# for a local IFS:
(IFS=$'\n'; sed -e '$!{N;s/^\(.*\).*\n\1.*$/\1\n\1/;D;}' <<<"${strings[*]}")

The here-string (as with all redirections) can go anywhere within a simple command.

Community
  • 1
  • 1
ack
  • 7,356
  • 2
  • 25
  • 20
  • This is a better real world solution because one doesn't know how many strings there are and a whole array of strings needs to be processed. In my case an array of four strings containing subdirectories 25 levels deep and the first 19 levels are common. [How to quickly find the deepest subdirectory](https://askubuntu.com/a/1187625/307523) – WinEunuuchs2Unix Nov 11 '19 at 17:16
  • I like this a lot but would prefer the clever grep technique. Can this be converted to use grep?? It is too restrictive to only work on the first two lines! – Steven Lu Apr 24 '20 at 08:11
  • Nice. In [my answer](https://stackoverflow.com/questions/6973088/longest-common-prefix-of-two-strings-in-bash/66683484#66683484) to this question, I've generalized it to handle embedded newlines. – Robin A. Meade Mar 18 '21 at 02:24
  • 1
    While neat, it seems that this solution relies on a specific version of `sed`. For example, on macOS the bundled `sed` returns nothing, but the GNU version of `sed` (installed via homebrew/macports) works as expected, despite none of the options being missing on the macOS version. Any ideas what the issue might be? It produces no errors, it just doesn't return anything. – Haravikk Feb 09 '22 at 13:02
  • @Haravikk see https://stackoverflow.com/a/65245765/900078 – pynexj Mar 03 '23 at 15:10
13

Yet another variant, using GNU grep:

$ string1="test toast"
$ string2="test test"
$ grep -zPo '(.*).*\n\K\1' <<< "$string1"$'\n'"$string2"
test t
Eugene Yarmash
  • 142,882
  • 41
  • 325
  • 378
  • 1
    This appears to be more portable than the sed approaches (Linux, Mac) – kermatt Mar 17 '16 at 16:43
  • But why use the z flag? – Steven Lu Apr 24 '20 at 04:47
  • Sorry I was looking at the BSD manpage and the z flag is not relevant here. But for GNU, the z flag looks for null bytes for end of line, which means that the multiline input can be regex matched to yield what the OP wants. Not bad. – Steven Lu Apr 24 '20 at 04:57
  • On macOS, you'll need to `brew install grep` to get GNU grep as `ggrep`. My MO has been to gradually infect my mac's runtime with GNU utils by incrementally "installing them", e.g. `ln -s /usr/local/bin/gdate /usr/local/bin/date`, as my /usr/local/bin is earlier in my $PATH this way I can keep the BSD utilities untouched at e.g. `/usr/bin/date` while reducing my need to branch for Darwin in my shell scripts as they rely more and more on GNU functionalities. – Steven Lu Apr 24 '20 at 04:58
11

This can be done entirely inside bash. Although doing string manipulation in a loop in bash is slow, there is a simple algorithm that is logarithmic in the number of shell operations, so pure bash is a viable option even for long strings.

longest_common_prefix () {
  local prefix= n
  ## Truncate the two strings to the minimum of their lengths
  if [[ ${#1} -gt ${#2} ]]; then
    set -- "${1:0:${#2}}" "$2"
  else
    set -- "$1" "${2:0:${#1}}"
  fi
  ## Binary search for the first differing character, accumulating the common prefix
  while [[ ${#1} -gt 1 ]]; do
    n=$(((${#1}+1)/2))
    if [[ ${1:0:$n} == ${2:0:$n} ]]; then
      prefix=$prefix${1:0:$n}
      set -- "${1:$n}" "${2:$n}"
    else
      set -- "${1:0:$n}" "${2:0:$n}"
    fi
  done
  ## Add the one remaining character, if common
  if [[ $1 = $2 ]]; then prefix=$prefix$1; fi
  printf %s "$prefix"
}

The standard toolbox includes cmp to compare binary files. By default, it indicates the byte offset of the first differing bytes. There is a special case when one string is a prefix of the other: cmp produces a different message on STDERR; an easy way to deal with this is to take whichever string is the shortest.

longest_common_prefix () {
  local LC_ALL=C offset prefix
  offset=$(export LC_ALL; cmp <(printf %s "$1") <(printf %s "$2") 2>/dev/null)
  if [[ -n $offset ]]; then
    offset=${offset%,*}; offset=${offset##* }
    prefix=${1:0:$((offset-1))}
  else
    if [[ ${#1} -lt ${#2} ]]; then
      prefix=$1
    else
      prefix=$2
    fi
  fi
  printf %s "$prefix"
}

Note that cmp operates on bytes, but bash's string manipulation operates on characters. This makes a difference in multibyte locales, for examples locales using the UTF-8 character set. The function above prints the longest prefix of a byte string. To handle character strings with this method, we can first convert the strings to a fixed-width encoding. Assuming the locale's character set is a subset of Unicode, UTF-32 fits the bill.

longest_common_prefix () {
  local offset prefix LC_CTYPE="${LC_ALL:=LC_CTYPE}"
  offset=$(unset LC_ALL; LC_MESSAGES=C cmp <(printf %s "$1" | iconv -t UTF-32)
                                           <(printf %s "$2" | iconv -t UTF-32) 2>/dev/null)
  if [[ -n $offset ]]; then
    offset=${offset%,*}; offset=${offset##* }
    prefix=${1:0:$((offset/4-1))}
  else
    if [[ ${#1} -lt ${#2} ]]; then
      prefix=$1
    else
      prefix=$2
    fi
  fi
  printf %s "$prefix"
}
Gilles 'SO- stop being evil'
  • 104,111
  • 38
  • 209
  • 254
  • A variant of this solution working on multibytes chars would be to use diff instead of cmp, and using as its input `printf %s "$1" | fold -w 1`. – jfg956 Aug 08 '11 at 09:06
  • @jfgagne Not quite, this would suppress newline characters. By the way, I like your sed solution, but it doesn't always work with multiline strings either. – Gilles 'SO- stop being evil' Aug 08 '11 at 11:27
  • How can this be adapted for `n` strings? – Paolo May 22 '23 at 10:18
  • @Paolo Sort the strings (e.g. using `sort`, but that requires extra work if the strings can contain newlines). Then apply any of the algorithms to the first and last string. Make sure to sort in a locale where distinct strings are not considered equivalent (the easy way is `env LC_ALL=C sort …`). – Gilles 'SO- stop being evil' May 22 '23 at 10:30
7

Grep short variant (idea borrowed from sed one):

$ echo -e "String1\nString2" | grep -zoP '^(.*)(?=.*?\n\1)'
String

Assumes string have no new line character. But easy may be tuned to use any delimiter.

Update at 2016-10-24: On modern versions of grep you may receive complain grep: unescaped ^ or $ not supported with -Pz, just use \A instead of ^:

$ echo -e "String1\nString2" | grep -zoP '\A(.*)(?=.*?\n\1)'
String
Hubbitus
  • 5,161
  • 3
  • 41
  • 47
4

Ok, in bash:

#!/bin/bash

s="$1"
t="$2"
l=1

while [ "${t#${s:0:$l}}" != "$t" ]
do
  (( l = l + 1 ))
done
(( l = l - 1 ))

echo "${s:0:$l}"

It's the same algorithm as in other languages, but pure bash functionality. And, might I say, a bit uglier, too :-)

oHo
  • 51,447
  • 27
  • 165
  • 200
Tanktalus
  • 21,664
  • 5
  • 41
  • 68
  • While this works after a fashion, it will get stuck in a (near) infinite loop if `s` is not longer than `t`, as the output of `${s:0:$l}` doesn't care if `l` is higher than the number of characters in `s` so it'll keep satisfying the loop condition. I'd personally rewrite this to use a loop like `for l in ${1..$((${#s} + 1))}` with a check to break inside. – Haravikk Feb 09 '22 at 13:51
4

Without sed, using the cmp utility to get the index of the 1st different character, and using process substitution to get the 2 strings to cmp:

string1="test toast"
string2="test test"
first_diff_char=$(cmp <( echo "$string1" ) <( echo "$string2" ) | cut -d " " -f 5 | tr -d ",")
echo ${string1:0:$((first_diff_char-1))}
jfg956
  • 16,077
  • 4
  • 26
  • 34
  • Using sed is a better solution though, as only one process need to be launched. – jfg956 Aug 07 '11 at 15:08
  • 2
    Good choice of tool, but wrong pre- and post-processing. `echo "$string1"` mangles a few strings, and you don't handle the case when one of the strings is a prefix of the other. You don't need the call to `cut` as the shell is perfectly capable of extracting the offset from the `cmp` output. A limitation of this approach is that `cmp` operates on bytes, not characters. – Gilles 'SO- stop being evil' Aug 07 '11 at 19:06
  • @Gilles: Could you show me an example where `echo` mangles a string ? In bash's man, I found an example with `echo -e "toto\ntata"`, so would it be safe to use `echo -E` (thanks for the printf example though). About the case where a string is a prefix of the other, I do not have a different output with `cmp (GNU diffutils) 2.8.1`. True about the possibility of avoiding `cut`, and totally true about not working on multibytes char. – jfg956 Aug 08 '11 at 08:10
  • 1
    Under bash, with a single argument, `echo` only mangles `^-[neE]+$`; though if the `xpg_echo` is set then `echo` also mangles backslashes. Also `echo` adds a newline, which explains why you didn't see that `foo` is a prefix of `foobar`: you were passing `foo\n` and `foobar\n` to `cut`. Try with `echo -n` or with `foo\n` and `foo\nbar`. – Gilles 'SO- stop being evil' Aug 08 '11 at 11:23
3

It's probably simpler in another language. Here's my solution:

common_bit=$(perl -le '($s,$t)=@ARGV;for(split//,$s){last unless $t=~/^\Q$z$_/;$z.=$_}print $z' "$string1" "$string2")

If this weren't a one-liner, I'd use longer variable names, more whitespace, more braces, etc. I'm also sure there's a faster way, even in perl, but, again, it's a trade-off between speed and space: this uses less space on what is already a long one-liner.

Tanktalus
  • 21,664
  • 5
  • 41
  • 68
2

Just yet another way using Bash only.

string1="test toast"
string2="test test"
len=${#string1}

for ((i=0; i<len; i++)); do
   if [[ "${string1:i:1}" == "${string2:i:1}" ]]; then
      continue
   else
      echo "${string1:0:i}"                       
      i=len
   fi
done
Community
  • 1
  • 1
chad
  • 21
  • 1
  • Minor enhancement, but you should really test the lengths of both `string1` and `string2` then select the shortest for `len`. For example, if `string1` was 'foobar' and `string2` was 'foo' then there's no need to compare more than 3 characters. You also don't really need the `then` branch of your `if`, just use `!=` and let the loop continue on its own? – Haravikk Feb 09 '22 at 14:46
1

If you have an option to install a python package, you can use this python utility

# install pythonp
pythonp -m pip install pythonp

echo -e "$string1\n$string2" | pythonp 'l1,l2=lines
res=itertools.takewhile(lambda a: a[0]==a[1], zip(l1,l2)); "".join(r[0] for r in res)'
1

Man, this is tough. It's an extremely trivial task, yet I don't know how to do this with the shell :)

here is an ugly solution:

echo "$2" | awk 'BEGIN{FS=""} { n=0; while(n<=NF) {if ($n == substr(test,n,1)) {printf("%c",$n);} n++;} print ""}' test="$1"
Lightness Races in Orbit
  • 378,754
  • 76
  • 643
  • 1,055
Karoly Horvath
  • 94,607
  • 11
  • 117
  • 176
  • This very fast, as-is, but has a couple of issues. (1) It doesn't handle mumti-byte characters. This is easily fixeded.. just change `%c` to `%s`.. (2) It reports incorrectly when two strings are identical other than one has a trailing `\n` and the other does not. In this case, the script reports the longer value... Correcting the trailing newline problem is probably not so easily fixed, as it is the behaviour of `awk` whch will append a trailing newline (which causes the problem). But, as I write this, I recall that there is a way to detect a `last-line` in `awk` (I think!). I'll check now. – Peter.O Aug 31 '12 at 04:20
  • I was thinking of `perl`'s `(eof)`, but you can prevent that final automatic output of `OFS` via [delayed handling of each input line](http://stackoverflow.com/questions/1646633/how-to-detect-eof-in-awk) .. One more point: The `echo "$2"` appends an extraneous `\n` to `$2` – Peter.O Aug 31 '12 at 04:39
  • Hi Karoly. [Again me](http://stackoverflow.com/a/6973184/938111)! Here, your script has also a similar issue: `awk 'BEGIN{FS=""} { n=0; while(n<=NF) {if ($n == substr(test,n,1)) {printf("%c",$n);} n++;} print ""}' test="/aa/bc/" <<< '/aa/bd/'` => it displays `/aa/b/` instead of `/aa/b`. Please try to improve your [tag:awk] script ;-) Cheers – oHo Sep 11 '14 at 08:46
  • This second example does not work too: `awk 'BEGIN{FS=""} { n=0; while(n<=NF) {if ($n == substr(test,n,1)) {printf("%c",$n);} n++;} print ""}' test="f/aa" <<< 'g/aa'` => it displays `/aa` instead of empty (the OP's question is about prefix). Courage and good luck ;-) Cheers – oHo Sep 11 '14 at 08:54
0

If using other languages, how about python:

cmnstr() { python -c "from difflib import SequenceMatcher
s1, s2 = ('''$1''', '''$2''')
m = SequenceMatcher(None,s1,s2).find_longest_match(0,len(s1),0,len(s2))
if m.a == 0: print(s1[m.a: m.a+m.size])"
}
$ cmnstr x y
$ cmnstr asdfas asd
asd

(h/t to @RickardSjogren's answer to stack overflow 18715688)

qneill
  • 1,643
  • 14
  • 18
0

Another python-based answer, this one based on the os.path module's native commonprefix function

#!/bin/bash
cat mystream | python -c $'import sys, os; sys.stdout.write(os.path.commonprefix(sys.stdin.readlines()) + b\'\\n\')'

Longform, that's

import sys
import os
sys.stdout.write(
    os.path.commonprefix(sys.stdin.readlines()) + b'\n'
)

/!\ Note: the entire text of the stream will be loaded into memory as python string objects before being crunched with this method


If not buffering the entire stream in memory is a requirement, we can use the communicative property and to the prefix commonality check between every input pair

$!/bin/bash
cat mystream | python -c $'import sys\nimport os\nfor line in sys.stdin:\n\tif not os.path.isfile(line.strip()):\n\t\tcontinue\n\tsys.stdout.write(line)\n') | pythoin sys.stdin:\n\tprefix=os.path.commonprefix([line] + ([prefix] if prefix else []))\nsys.stdout.write(prefix)''

Long form

import sys
import os
prefix = None
for line in sys.stdin:
    prefix=os.path.commonprefix(
        [line] + ([prefix] if prev else [])
    )
sys.stdout.write(prefix)

Both of these methods should be binary-safe, as in they don't need input/output data to be ascii or utf-8 encoded, if you run into encoding errors, python 3 renamed sys.stdin to sys.stdin.buffer and sys.stdout to sys.stdout.buffer, which will not automatically decode/encode input/output streams on use

ThorSummoner
  • 16,657
  • 15
  • 135
  • 147
0

I've generalized @ack's answer to accommodate embedded newlines.

I'll use the following array of strings as a test case:

a=(
  $'/a\n/b/\nc  d\n/\n\ne/f'
  $'/a\n/b/\nc  d\n/\ne/f'
  $'/a\n/b/\nc  d\n/\ne\n/f'
  $'/a\n/b/\nc  d\n/\nef'
)

By inspection we can see that the longest common prefix is

$'/a\n/b/\nc  d\n/\n'

We can compute this and save the result into a variable with the following:

longest_common_prefix=$(
  printf '%s\0' "${a[@]}" \
  | sed -zE '$!{N;s/^(.*).*\x00\1.*$/\1\x00\1/;D;}' \
  | tr \\0 x # replace trailing NUL with a dummy character ①
)
longest_common_prefix=${longest_common_prefix%x} # Remove the dummy character
echo "${longest_common_prefix@Q}" # ②

Result:

$'/a\n/b/\nc  d\n/\n'

as expected. ✔️

I applied this technique in the context of path specs here: https://unix.stackexchange.com/a/639813


① To preserve any trailing newlines in this command substitution, we used the usual technique of appending a dummy character that is chopped off afterwards. We combined the removal of the trailing NUL with the addition of the dummy character (we chose x) in one step using tr \\0 x.

② The ${parameter@Q} expansion results in "a string that is the value of parameter quoted in a format that can be reused as input." – bash reference manual. Requires bash 4.4+ (discussion). Otherwise, you can inspect the result using one of the following:

Robin A. Meade
  • 1,946
  • 18
  • 17