4

Let us say I have two relatively large tab-delimited files file1.txt, file2.txt.

file1.txt
id\tcity\tcar\ttype\tmodel

file2.txt 
id\tname\trating

Let us suppose that file1.txt has 2000 unique ids, and therefore 2000 unique rows, and file2.txt has only 1000 unique rows, and therefore 1000 unique ids. Is there a way to merge the two tables?

Case 1. merge them by id in file1.txt, where when there is no id in file2.txt NAs would be filled in.

Case2. merge them by id in file2.txt, where when only the ids in file2.txt will be printed out with the fields in file1.txt and file2.txt.

Note: the merged new files should also be tab-delimited file, with a header file as well. Note2. I'd also appreciate suggestions on how to do it when there is no header as well.

Thanks!

Dnaiel
  • 7,622
  • 23
  • 67
  • 126
  • Define what you mean by "merge" – Bohemian Nov 09 '12 at 00:25
  • http://stackoverflow.com/questions/5467690/how-to-merge-two-files-using-awk?rq=1 or http://stackoverflow.com/questions/6393333/use-unix-join-command-to-merge-two-files?rq=1 or others from the "Related" sidebar may be relevant – David Z Nov 09 '12 at 00:25
  • @Bohemian by merge i mean join – Dnaiel Nov 09 '12 at 00:27
  • @DavidZaslavsky thanks a lot, it looks like a great solution but join requires the files to be sorted, and the awk would not add NAs when there is no data. Do you know of some solution that would add NAs and keep the tabs? – Dnaiel Nov 09 '12 at 00:57

3 Answers3

10
join -j 1 <(sort file1.txt) <(sort file2.txt)

Does your 'case 2' approach with only standard unix tools. Of course, if the files are sorted, you can drop the sort.

If you included the headers, you might rely on the ids being numerical for sorting the joined header to the top:

join -j 1 <(sort file1.txt) <(sort file2.txt) | sort -n

With

  • file1.txt

    id  city    car type    model
    1   york    subaru  impreza king
    2   kampala toyota  corolla sissy
    3   luzern  chrysler    gravity falcon
    
  • file2.txt

    id  name    rating
    3   zanzini PG
    2   tara    X
    
  • output:

    id  city    car type    model   name    rating
    2   kampala toyota  corolla sissy   tara    X
    3   luzern  chrysler    gravity falcon  zanzini PG
    

PS To preserve the TAB separator character, pass the -t option:

 join -t'    ' ...

It's kind of hard to show on SO that ' ' contained a TAB character. Type it with ^VTAB (e.g. in bash)

sehe
  • 374,641
  • 47
  • 450
  • 633
1

This worked for me in case 1:

join -t $'\t' -1 1 -2 1 -a 1 -a 2 <(sort fileone.txt) <(sort filetwo.txt) | sort -n -t $'\t' > filethree.txt

then:

awk '{if(NF+0<7) printf "%s\tNA\tNA\n", $0; else print $0}' filethree.txt

AWE
  • 4,045
  • 9
  • 33
  • 42
0

Try doing this :

perl -lane '
    END{print "$_$h{$_}" for sort keys %h}
    $h{$F[0]} .= "\t" .  join "\t", @F[1..$#F];
' file1.txt file2.txt

This script join on ids (first col).

Gilles Quénot
  • 173,512
  • 41
  • 224
  • 223
  • thanks! it works but with some caveats: (1) what if the id is in column 3 in file1.txt, and in column 10 in file2.txt? How would I change the code accordingly? (2) It outputs the header at the end instead of keeping the header on the first line. How could I keep the header on top? Thanks again! – Dnaiel Nov 09 '12 at 00:46
  • oh one more caveat (3) for some odd reason the output was not tab delimited between column 1 and 2. – Dnaiel Nov 09 '12 at 00:48
  • perl -v This is perl, v5.8.8 built for x86_64-linux-thread-multi – Dnaiel Nov 09 '12 at 00:52
  • hahahaha just tried the last version but I still have the same issues (1)-(3) not sure why, quite odd. – Dnaiel Nov 09 '12 at 00:59
  • also, how could I add tab delimited NAs when there is no data? is it possible? I still run into the caveat of header at bottom, and somehow there is no tab in between the last col of file 1 and file 2, and as i said no NAs... thanks, sorry to keep bothering you. – Dnaiel Nov 09 '12 at 01:03
  • You can filter the input files before processing with the perl script for your extra requirements. – Gilles Quénot Nov 09 '12 at 01:22