0

Let's say that I have the following text in a file:

foo.bar.baz
bar.baz
123.foo.bar.baz
pqr.abc.def
xyz.abc.def
abc.def.ghi.jkl
def.ghi.jkl

How would I remove duplicates from the file, on the basis of postfixes? The expected output without duplicates would be:

bar.baz
pqr.abc.def
xyz.abc.def
def.ghi.jkl

(Consider foo.bar.baz and bar.baz. The latter is a substring postfix so only bar.baz remains. However, neither of pqr.abc.def and xyz.abc.def are not substring postfixes of each other, so both remain.)

2 Answers2

1

Try this:

#!/bin/bash

INPUT_FILE="$1"

in="$(cat $INPUT_FILE)"
out="$in"

for line in $in; do
  out=$(echo "$out" | grep -v "\.$line\$")
done

echo "$out"

You need to save it to a script (e.g. bashor.sh), make it executable (chmod +x bashor.sh) and call it with your input file as the first argument:

./bashor.sh path/to/input.txt
  • I came up with a similar solution. you have one bug though: grep will interpret `.` as any character, so you have to escape it: both the naked `.` and the dots in `$line` (that is, assuming there are no other special characters, and prefixes are always delimited with dot) – Karoly Horvath Feb 18 '14 at 12:31
  • Actually, my solution will be fine if there are only three-character tokens (I intentionally used the `.` a a wildcard). But of course that was a wild assumption, so I escaped the `.` in questions... – Michael Schlottke-Lakemper Feb 18 '14 at 12:36
  • yes, that's a wild assumption. if not, `a.a` will match `aaa`. I'm more worried about that escaping... – Karoly Horvath Feb 18 '14 at 12:41
0

Use sed to escape the string for regular expressions, prefix ., postfix $ and pipe this into GNU grep (-f - doesn't work with BSD grep, eg. on a mac).

sed 's/[^-A-Za-z0-9_]/\\&/g; s/^/./; s/$/$/' test.txt |grep -vf - test.txt

I just used to regular expression escaping from another answer and didn't think about whether it is reasonable. On first sight it seems fine, but escapes too much, though probably this is not an issue.

Community
  • 1
  • 1
Jens Erat
  • 37,523
  • 16
  • 80
  • 96