0

Related question: https://stackoverflow.com/posts/18164848

The input file input.txt is a tab delimited unicode txt with

a  A   e  f  m
b  B   g  h
c  C   i  j
b  B   k  l

I want to match by the first and second column and merge. So I want to get output.txt with

a  A   e  f  m
b  B   g  h     k  l
c  C   i  j

The code has to detect the maximum number of columns in the input. Since it is 5 in this example, "k l" were put from 6th column.

Actually I almost managed to do this using Matlab when they are all numbers. But oh, when they were letters, Matlab was so bad at handling unicode, although I read stackoverflow about how to deal with unicode in Matlab I gave up. So I now turned to python.

Nirk at https://stackoverflow.com/posts/18164848 responded that the following line will do.

awk -F\t '{a=$1 "\t" $2; $1=$2=""; x[a] = x[a] $0} END {for(y in x) print y,x[y]}'

However this code doesn't seem to specify input and output file.

Community
  • 1
  • 1
user1849133
  • 527
  • 1
  • 7
  • 18

3 Answers3

3

awk is pipe-based linux command. To feed input file and get output, you can do like this: awk -F\t '{a=$1 "\t" $2; $1=$2=""; x[a] = x[a] $0} END {for(y in x) print y,x[y]}' < INPUT.TXT > OUTPUT.TXT

However, the awk program above can hardly match what you need "The code has to detect the maximum number of columns in the input. Since it is 5 in this example, "k l" were put from 6th column.".

You can try this python program:

max_value_fields = 0
values = dict()

with file("input.txt") as f:
    keys = []
    for line in f:
        line    = line.strip()
        fs      = line.split('\t')

        key = '%s\t%s' % (fs[0], fs[1])
        if key not in values:
            values[key] = list()
            keys.append(key)
        values[key].append(fs[2:])

        value_fields = len(fs) - 2
        if value_fields > max_value_fields:
            max_value_fields = value_fields

with file("output.txt", 'w+') as f:
    for key in keys:
        fields = [key]
        for value_list in values[key]:
            fields.extend([value for value in value_list])
            fields.extend(['' for i in xrange(max_value_fields - len(value_list))])
        print >> f, '\t'.join(fields)
Paul Chan
  • 41
  • 2
  • Thank you! This worked but it seems it can't read unicode text? I was failing on unicode text input and when I changed this to UTF-8 it worked. Can it work in all sorts of txt input? – user1849133 Aug 11 '13 at 09:16
  • There are two ways: first, you can use a linux command iconv to change text encoding to UTF-8 before processing; second, you can use mudule codecs to open file instead. codecs.open has an dedicated argument to specify the text encoding. – Paul Chan Aug 11 '13 at 10:31
2

I would read the data twice: a 1st pass for identifying the maximum number of columns, then merge rows in the 2nd pass. Results are printed (basically in random order) in the END.

awk -f script.awk infile infile > outfile

$ cat script.awk 
# BEGIN block: separators
BEGIN { FS = OFS = "\t" }

# Loop #1: detect maximum number of columns
NR == FNR { max = NF > max ? NF : max ; next }

# Loop #2: merge rows
{
    k = $1 FS $2

    if (k in a) {
        for (i = 3; i <= NF; i++) {
            a[k] = a[k] OFS $i
        }
    } else {
        NF = max
        a[k] = $0
    }
}

# END block: Print results
END { for (i in a) print a[i] }
Hermann
  • 166
  • 2
1

Try this:

awk '{x=$1FS$2;$1=$2="";a[x]=a[x]?a[x]FS$0:$0}END{for(x in a) print x,a[x]}' input.txt
jaypal singh
  • 74,723
  • 23
  • 102
  • 147
  • @jaypal_singh Thanks but this didn't work for my real example. What I posted was merely an example. For my real example which is too big to post here, I just added ">output.txt" to your code. And the output seemed not close to the desired output. Also I think this is failing at making tab delimited unicode txt. There was a question about tab delimited output at http://stackoverflow.com/questions/5374239/tab-separated-values-in-awk But I couldn't apply this to my situation. – user1849133 Aug 11 '13 at 06:18