0

Let say I have a file as below:

number 2 6 7 10 number 6 13  
name1 A B C D name1 B E   
name2 A B C D name2 B E  
name3 B A D A name3 A F  
name4 B A D A name4 A F  

I wish to remove the entirely the same duplicate columns and the output file is as below:

number 2 6 7 10 13  
name1 A B C D E   
name2 A B C D E  
name3 B A D A F  
name4 B A D A F  

I use sort and uniq command for lines but never know how to do for columns. Can anyone suggest a good way?

fedorqui
  • 275,237
  • 103
  • 548
  • 598
bison72
  • 324
  • 3
  • 15

5 Answers5

4

Here is a way with awk that preserves the order

awk 'NR==1{for(i=1;i<=NF;i++)b[$i]++&&a[i]}{for(i in a)$i="";gsub(" +"," ")}1' file

Output

number 2 6 7 10 13  
name1 A B C D E   
name2 A B C D E  
name3 B A D A F  
name4 B A D A F  

How it works

NR==1

If it is the first record

for(i=1;i<=NF;i++)

A loop over the fields, NF is the number of fields

b[$i]++&&a[i]

If there has been more than one occurrence of $i (The data contained in field i), then add an element to array a with the key of i.

This next block is executed on all records(including record one).

{for(i in a)$i="";

For every key in a set the corresponding field to nothing.

gsub(" +"," ")

Remove extra spaces

1

Always evaluates to true so print all records.

Community
  • 1
  • 1
2

This Perl one-liner will do the trick:

perl -an -e '@cols = grep { !$seen{$F[$_]}++ } 0..$#F unless @cols; print join " ", @F[@cols],"\n"' inputfile

-a splits each line of inputfile into @F. The first line of the file is used to construct the list of column indexes from left to right, keeping only those which are unseen. Next it prints the slice of @F containing just those columns for each line.

Ben Grimm
  • 4,316
  • 2
  • 15
  • 24
  • new to perl. what does `$_` mean? – qqibrow Jan 27 '15 at 05:15
  • 1
    `$_` is the default variable. If you read that statement from right to left: `unless @cols` is defined, for each value (`$_`) from `0` through the last index in `@F`, test if the hash `%seen` has not been set with the value of `$F[$_]` (and increment so that the test fails when seen again), if not seen `grep` passes the index in `$_` through to `@cols`. – Ben Grimm Jan 27 '15 at 05:21
1

You can use awk:

NR == 1 {
  for (ii = 1; ii <= NF; ii++) {
    cols[$ii] = ii
  }
  for (ii in cols) {
    printf "%s ", ii
  }
  print ""
}

NR > 1 {
  for (ii in cols) {
    printf "%s ", $cols[ii]
  }
  print ""
}

The above may reorder the columns, but a bit more effort could fix that if necessary.

John Zwinck
  • 239,568
  • 38
  • 324
  • 436
1

Removing duplicates lines can be done in just one awk command:

awk '!a[$0]++'

This gets track of the amount of times a line appeared. Once a line appeared, a[this row] equals 1, so when it comes again a[this row] is already True and the ! negates the condition, so it is not printed.

In your case, you want to remove the duplicate columns. But what about creating a function transpose to convert rows into columns and viceversa?

I already did it in my answer to Using bash to sort data horizontally:

transpose () {
  awk '{for (i=1; i<=NF; i++) a[i,NR]=$i; max=(max<NF?NF:max)}
        END {for (i=1; i<=max; i++)
              {for (j=1; j<=NR; j++) 
                  printf "%s%s", a[i,j], (j<NR?OFS:ORS)
              }
        }'
}

Then, it becomes trivial:

$ cat file | transpose | awk '!a[$0]++' | transpose
number 2 6 7 10 13
name1 A B C D E
name2 A B C D E
name3 B A D A F
name4 B A D A F
Community
  • 1
  • 1
fedorqui
  • 275,237
  • 103
  • 548
  • 598
0

Here is a way that fully works with python3 preserving the order

$ cat remove_duplicate_fields.py
import csv
import sys
reader = csv.reader(sys.stdin, delimiter=' ')
columns = []
for row in reader:
    for i, col in enumerate(row):
        if i >= len(columns):
        columns.append([col])
            else:
            columns[i].append(col)
seen_columns = set()
unique_columns = []
matrix_transpose = [list(i) for i in zip(*columns)]
for col in zip(*matrix_transpose):
    if col not in seen_columns:
       seen_columns.add(col)
       unique_columns.append(col)
for column in zip(*unique_columns):
    print(' '.join(column))

It gives the requested result:

$ echo $'number 2 6 7 10 number 6 13\nname1 A 1 C D name1 B E\nname2 A B C D name2 B E\nname3 B A D A name3 A F\nname4 B A D A name4 A F'
number 2 6 7 10 number 6 13
name1 A B C D name1 B E
name2 A B C D name2 B E
name3 B A D A name3 A F
name4 B A D A name4 A F

$ echo $'number 2 6 7 10 number 6 13\nname1 A B C D name1 B E\nname2 A B C D name2 B E\nname3 B A D A name3 A F\nname4 B A D A name4 A F' | python3 remove_duplicate_fields.py
number 2 6 7 10 13
name1 A B C D E
name2 A B C D E
name3 B A D A F
name4 B A D A F

In addition it also gives a correct result when changing the value of the third field in the second line of the input to 1 because it does not remove the column. This is because column 3 is now not the same as column 7.

$ echo $'number 2 6 7 10 number 6 13\nname1 A 1 C D name1 B E\nname2 A B C D name2 B E\nname3 B A D A name3 A F\nname4 B A D A name4 A F' | python3 remove_duplicate_fields.py
number 2 6 7 10 6 13
name1 A 1 C D B E
name2 A B C D B E
name3 B A D A A F
name4 B A D A A F
Yishayg
  • 1
  • 1