Remove redundancy lines for "almost similar" strings

Question

I have the below file:

ab=5
ac=6
ad=5
ba=5
bc=7
bd=4
ca=5
cb=7
cd=3
...

"ab" and "ba", "ac" and "ca", "bc" and "cb" are redundant. How do I eliminate these redundant lines in bash ?

Expected output:

ab=5
ac=6
ad=5
bc=7
bd=4
cd=3

you are expected to add your own code/research effort while asking, see https://meta.stackoverflow.com/questions/261592/how-much-research-effort-is-expected-of-stack-overflow-users... though, given interesting question, you've got plenty of answers this time :) — Sundeep, Dec 30 '17 at 14:23

score 2 · Accepted Answer · answered Dec 30 '17 at 13:16

2

$ awk '{x=substr($0,1,1); y=substr($0,2,1)} !seen[x>y?x y:y x]++' file
ab=5
ac=6
ad=5
bc=7
bd=4
cd=3

answered Dec 30 '17 at 13:16

Ed Morton

188,023
17
78
185

RomanPerekhrest · Answer 2 · 2017-12-30T11:24:06.797

1

Short awk solution:

awk '{ c1=substr($0,1,1); c2=substr($0,2,1) }!a[c1 c2]++ && !((c2 c1) in a)' file

c1=substr($0,1,1) - assign the extracted 1st character to variable c1
c2=substr($0,2,1) - assign the extracted 2nd character to variable c2
!a[c1 c2]++ && !((c2 c1) in a) - crucial condition based on mutual exclusion between "similar" 2-character sequences

The output:

ab=5
ac=6
ad=5
bc=7
bd=4
cd=3

edited Dec 30 '17 at 11:24

answered Dec 30 '17 at 11:11

RomanPerekhrest

88,541
4
65
105

score 1 · Answer 3 · answered Dec 30 '17 at 14:22

Here's one with perl, generic solution irrespective of number of characters before =

$ cat ip.txt
ab=5
ac=6
abd=51
ba=5
bad=23
bc=7
bd=4
ca=5
cb=7
cd=3

$ perl -F= -lane 'print if !$seen{join "",sort split//,$F[0]}++' ip.txt
ab=5
ac=6
abd=51
bc=7
bd=4
cd=3

like awk, by default uninitialized variables evaluate to false
-F= use = as field separator, results saved in @F array
$F[0] will give first field, i.e the characters before =
split//,$F[0] will give array with individual characters
sort by default does string sorting
join "" will then form single string from the sorted characters with null string as separator
See https://perldoc.perl.org/perlrun.html#Command-Switches for documentation on -lane and -F options. Use -i for inplace editing

score 0 · Answer 4 · answered Dec 30 '17 at 10:05

0

Could you please try following and let me know if this helps you, I have written and tested it with GNU awk.

awk -F'=' '{
split($1,array,"")}
!((array[1],array[2]) in a){
  a[array[1],array[2]];
  a[array[2],array[1]];
  print;
  next
}
!((array[2],array[1]) in a){
  a[array[1],array[2]];
  a[array[2],array[1]];
  print;
}
'   Input_file

Output will be as follows.

ab=5
ac=6
ad=5
bc=7
bd=4
cd=3

answered Dec 30 '17 at 10:05

RavinderSingh13

130,504
14
57
93

2

Using null as a field separator is undefined behavior per POSIX so only some awks (e.g. GNU awk) will split the string into characters, others will do other things. – Ed Morton Dec 30 '17 at 13:19

Remove redundancy lines for "almost similar" strings

4 Answers4