3

I'd like to use a script to do the following. I have two files. I'd like to replace the contents of column 2 ("A1" in the example below) and 3 ("A2" in the example below) in file2 with the contents of column 2 ("A1_new" in the example below) and 3 ("A2_new" in the example below) in file1. There are two conditions:

  • This has to be done based on a key-variable ("SNP" in the examples below) which is present in both files in column 1. Meaning: the script should check the presence of key-variable X in file1 and file2 and then replace the contents of file2 with file1 as described above. If there is no match, nothing should be done.
  • Both files are very large: file2 is about 38Gb in size (1,529 columns and ~8.3 million rows), file1 is smaller as it only has 6 columns instead of 1,529.
  • I can make file1 such that it will only contain the changes to be implemented. The number of rows would then be around 8,000 I guesstimate.
  • I should also note that the number of keys isn't necessarily equal nor in sorted in the same order!

Below I have put the heads of the two files. So the expected result would look like this for the_new_file2:

SNP A1 A2 UPID00126-UPID00126-A318-0034 UPID01799-UPID01799-A318-0492 UPID01890-UPID01890-A318-0151 UPID00306-UPID00306-A318-0245 UPID00629-UPID00629-A318-0033 UPID01934-UPID01934-A318-0215 UPID00526-UPID00526-A318-0526
rs12565286 C G 0.0190125 0.0339966 0.0130005 0.00500488 0.0109863 0.0180054 0.00500488
rs11804171 A T 0.0249939 0.154999 0.0310059 0.0169983 0.0159912 0.0230103 0.0159912
rs2977670 C G 1.95801 1.745 1.31201 1.97299 1.97198 1.96301 1.97299
rs143225517 C T 0 0.93399 0.989014 0 0 0 0
rs3094315 A G 2 1 1 2 2 2 2
rs3131972 G A 1.99701 0.997009 0.997986 1.99701 1.99499 1.987 1.99701
rs3131971 C T 1.89301 1.07901 1.08902 1.98901 1.86801 1.92798 1.982
rs2073814 G C 1.98001 0.983002 0.993988 1.97601 1.98401 1.979 1.97601
rs2073813 A G 0.00299072 0.915039 0.899017 0.00299072 0.00299072 0.00201416 0.00299072
rs3131969 G A 1.996 1.03097 1.09598 1.99701 1.99701 1.99701 1.996
rs3131968 G A 1.996 1.02801 1.09396 1.996 1.99701 1.99701 1.99499
rs3131967 C T 1.931 1.01202 1.08701 1.98999 1.93298 1.97101 1.987
rs3115859 A G 1.996 1.02502 1.09497 1.99701 1.99399 1.98599 1.991
rs34927559 G A 1.91898 1.61298 1.845 1.91901 1.94299 1.77402 1.952
rs115173026 T G 0.423004 0.420013 0.11499 1.09097 0.51001 0.618988 0.114013
chr1:956061:I I D 0.356018 0.395996 0.10202 1.09903 0.530975 0.505005 0.0679932
rs116130865 T C 0.420014 0.420013 0.113983 1.09003 0.503968 0.535004 0.11203
rs111893801 T C 0.0239868 0.0180054 0.00601196 0.0209961 0.019989 0.0610046 0.00299072

You can see in bold in the above "expected result" and below in the head of file1 and file2 that columns 2 and 2 for key "chr1:956061:I" changed. From "AG G" to "I D".

I do have some experience in bash. But I'm thinking awk will be quite slow, and probably perl or python might be much faster. I've seen something here: How to search and replace text in a file using Python?. But that is just replacing, and not based on a key-variable.

It would be awesome if someone can give me some pointers on how to go about this.

Thanks!


Upon popular request below the head of file 1 and file 2.

Head file1

SNP A1_new A2_new
chr1:956061:I I D
chr1:3716325:I I D
chr1:4095845:I I D
chr1:4097079:I I D

Head file2

SNP A1 A2 UPID00126-UPID00126-A318-0034 UPID01799-UPID01799-A318-0492 UPID01890-UPID01890-A318-0151 UPID00306-UPID00306-A318-0245 UPID00629-UPID00629-A318-0033 UPID01934-UPID01934-A318-0215 UPID00526-UPID00526-A318-0526
rs12565286 C G 0.0190125 0.0339966 0.0130005 0.00500488 0.0109863 0.0180054 0.00500488
rs11804171 A T 0.0249939 0.154999 0.0310059 0.0169983 0.0159912 0.0230103 0.0159912
rs2977670 C G 1.95801 1.745 1.31201 1.97299 1.97198 1.96301 1.97299
rs143225517 C T 0 0.93399 0.989014 0 0 0 0
rs3094315 A G 2 1 1 2 2 2 2
rs3131972 G A 1.99701 0.997009 0.997986 1.99701 1.99499 1.987 1.99701
rs3131971 C T 1.89301 1.07901 1.08902 1.98901 1.86801 1.92798 1.982
rs2073814 G C 1.98001 0.983002 0.993988 1.97601 1.98401 1.979 1.97601
rs2073813 A G 0.00299072 0.915039 0.899017 0.00299072 0.00299072 0.00201416 0.00299072
rs3131969 G A 1.996 1.03097 1.09598 1.99701 1.99701 1.99701 1.996
rs3131968 G A 1.996 1.02801 1.09396 1.996 1.99701 1.99701 1.99499
rs3131967 C T 1.931 1.01202 1.08701 1.98999 1.93298 1.97101 1.987
rs3115859 A G 1.996 1.02502 1.09497 1.99701 1.99399 1.98599 1.991
rs34927559 G A 1.91898 1.61298 1.845 1.91901 1.94299 1.77402 1.952
rs115173026 T G 0.423004 0.420013 0.11499 1.09097 0.51001 0.618988 0.114013
chr1:956061:I AG A 0.356018 0.395996 0.10202 1.09903 0.530975 0.505005 0.0679932
rs116130865 T C 0.420014 0.420013 0.113983 1.09003 0.503968 0.535004 0.11203
rs111893801 T C 0.0239868 0.0180054 0.00601196 0.0209961 0.019989 0.0610046 0.00299072
rs9777931 T C 1.54102 1.27399 1.75003 0.863983 1.46597 1.33801 1.07199
rs3121553 C G 0.419007 0.364014 0.112976 1.08801 0.562989 0.546997 0.110016
rs809912 T C 1 1 0 1 1 0 1
rs196128 G T 1 1 0 1 1 0 1
rs262676 A T 1 1 0 1 1 0 1
rs196127 A T 1 1 0 1 1 0 1
rs149435595 C A 0.019989 0.0509949 0.478027 0.0580139 0.337006 0.0169983 0.705994
rs12082939 T C 1.021 1.17899 0.442963 1.22 1.008 0.00900269 1.056
rs262675 G A 0.998993 0.822998 0 1 0.997986 0 0.971985
rs262672 T G 1 1 0 1 1 0 1
rs262671 A C 1 1 0 1 0.997986 0 1
rs3107126 A C 0 0 0.0039978 0.0119934 0 0.00100708 0.0150146
rs11583093 A G 0.00799561 0.184998 0.0130005 0.00698853 0 0.0159912 0.00900269
rs262670 T C 1 1 0 1 1 0 1
rs262669 A G 1 1 0.0039978 1.01199 1 0 1.01501
rs9442619 T C 0 0 0.0039978 0.0109863 0 0 0.0150146
rs2257182 C T 1.021 1.17899 0.442963 1.22 1.008 0.00900269 1.056
rs262668 G A 1 1 0.0039978 1.01199 1 0 1.01501
rs262667 C T 1 1 0 1 1 0 1
rs34944978 A G 0 0 0 0 0 0 0
rs262665 G A 1 1 0 1 1 0 1
rs262664 C G 1 1 0.0039978 1.01199 1 0 1.01501
rs2503704 C T 1.43301 1.43503 1.28302 1.5 1.86999 1.444 1.85998
rs262663 C T 1 1 0.0039978 1.01199 1 0.00100708 1.01501
rs262662 G A 1 1 0.0039978 1.01199 1 0 1.01501
rs3107125 A G 0 0 0 0 0 0 0
rs2459984 G A 1.021 1.17899 0.442963 1.22 1.008 0.00799561 1.056
rs61776606 A G 0.0169983 0 0.00299072 0.0289917 0.00601196 0.676025 0
rs2459983 G A 1.39798 1.858 1.871 1.798 1.91199 1.54697 1.92499
rs262661 C T 1 1 0 1 1 0 1
rs262660 G A 1 1 0 1 1 0 1
chr1:2087094:I AATG A 0.997986 1.00101 0.00900269 1.04199 1 0.00201416 1.01599
rs3123592 C T 0 0 0.0039978 0.0109863 0 0 0.0159912
rs381664 G A 1 1 0 1 1 0 1
rs72629470 A G 1 1 0 1.00101 0 0 0
rs151048557 T C 0.0039978 0 0 0.00201416 0.0180054 0.00100708 0
rs72629471 G C 1 1 0 1 0 0 0
rs72629472 C G 1 0 0 1 0 0 0
rs116329948 T C 0 0.266998 0 0 0 0.0039978 0
rs16823335 A G 1 1 0 1 0 0 0
rs12028094 T C 1 1 0 1 0 0 0
rs141183701 T C 0 0 0.00201416 0 0.00100708 0.00299072 0.0150146
rs10797377 A G 1 1.004 0 1 0 0 0
rs72629473 A G 0.00201416 0 0 0.985992 0 0 0
rs12063869 A T 1 1.004 0 1 0 0 0
rs2889476 C T 1.08701 1.30301 0.019989 1.004 0.052002 0.00299072 0.0100098
rs946175 G A 1.08801 1.30301 0.019989 1.004 0.052002 0.00299072 0.00900269
rs946176 T C 1 1 0 1 0 0 0
rs946177 G A 1.08701 1.302 0.019989 1.004 0.052002 0.00299072 0.00900269
rs61763644 T C 0.0710144 0.0039978 0.0690002 0.0209961 0.00100708 0 0
rs2027262 C T 1 1 0 1 0 0 0
rs2027263 G A 1 1.004 0 1 0.00100708 0 0
rs2027264 A G 1 1 0 1 0 0 0
rs111412502 T C 1 1.00201 0 1 0 0 0
rs11802804 G T 1 1.00699 0 1 0.00100708 0 0
rs111668563 G T 1 0.944 0 1 0 0 0
rs112539394 A G 1 0.941986 0 1 0 0 0
rs12028439 G A 1.00601 1.01001 0 1.00201 0.00100708 0 0
rs12024879 A G 1 1.00903 0 1 0.00100708 0 0
rs12027818 T C 1.02399 1.013 0.00100708 1.03799 0.131989 0 0.00100708
rs6685625 C T 0.109985 0.0169983 0.00799561 0.131012 0.0390015 0.00299072 0.14801
rs12027883 T C 1 1.00998 0 1 0 0 0
rs7534897 T C 0.0130005 0.596985 0.562012 0.00500488 0.0159912 1.11197 0.317993
rs6681651 T C 1.09698 1.591 0.0390015 1.01999 0.0539856 0.00201416 0.00900269
rs6691208 A C 1.091 1.65201 0.019989 1.005 0.0539856 0.00299072 0.00799561
rs4357486 C G 1 1.04901 0.0700073 1 0 0 0.00299072
rs4486390 T C 1 1.04901 0.0700073 1 0 0 0.00299072
rs7536203 A G 0.437012 0 0.00100708 0.0119934 0.148987 0.00201416 0
rs59869348 C T 1.091 1.629 0.019989 1.004 0.0549927 0.00201416 0.00799561
rs59041920 T C 1.091 1.629 0.0190125 1.004 0.0549927 0.00201416 0.00799561
rs60366070 T C 0.0910034 0.604034 0.019989 0.00500488 0.0549927 0.00201416 0.00799561
rs56401121 A G 0.0910034 0.604034 0.019989 0.00500488 0.0549927 0.00201416 0.00799561
rs72629480 T C 1 1.061 0 1 0 0 0
rs58317195 G A 0.0910034 0.60498 0.019989 0.00500488 0.0549927 0.00201416 0.00799561
rs114933164 C A 0.645996 0.54602 0.0190125 0.0039978 0.0549927 0.00201416 0.00698853
rs6680039 G C 0.0920105 0.603027 0.0190125 0.00500488 0.0549927 0.00201416 0.00799561
rs114412925 A G 0.0270081 0 0 0.0450134 0.131012 0 0
rs6680573 G C 1.09201 1.67703 0.0190125 1.005 0.0559998 0.00201416 0.00799561
rs56200661 T C 1 0.0039978 0 0.919006 0 0 0
rs72629481 A G 1 0.0039978 0 0.919006 0 0 0
rs17389878 A G 0 0 0 0 0 0 0
rs116142632 T G 0.00500488 0.00201416 0.274994 0 0.00799561 0.567017 0.127991
rs16823357 C T 0 0.609985 0 0 0.00100708 0 0.00100708
rs12564456 G C 1 1.974 0 1 0.00100708 0.00100708 0.00201416
rs55949537 T A 1 1.978 0 1.00201 0.00100708 0 0.00201416
rs55784203 T G 1 1.98001 0 1.00201 0.00100708 0 0.00201416
rs56091202 G A 1 1.98001 0 1.00201 0.00100708 0 0.00201416
rs10909862 A G 1 0.0190125 2 0.989014 1.99799 1.99799 1.99701
rs10492936 A G 0 0.615021 0 0 0 0 0.00201416
rs67758087 G A 1 1.98401 0 1.00201 0.00100708 0 0.00201416
rs12033927 G A 1 1.37503 0 1.00101 0 0 0
rs12034794 G T 1 1.98999 0 1.00201 0.00100708 0 0.00201416
rs4576609 T C 1 1.99301 0 1 0 0 0
rs3795262 G T 1 1.98901 0 1 0 0 0
rs3795263 A G 0.0950012 0 0.453033 0.627014 1.10703 0.880981 0.627991
rs55873718 G A 1 1.996 0 1 0 0 0
rs61763646 A G 0.052002 0 0.0910034 0.0209961 0.0039978 0 0
rs56176404 G A 1 1.38098 0 1 0 0 0
rs7412983 C A 1 1.99701 0 1 0 0 0
rs58265521 A C 1 1.99701 0 1 0 0 0
rs117759489 A C 0.00201416 0 0.0039978 0.0119934 0.00201416 0 0
rs60287244 G A 1 1.99701 0 1 0 0 0
rs12034766 G A 1 1.99701 0 1 0 0 0
rs12033401 T C 1 1.35101 0 1 0 0 0
rs2142569 A G 0.105988 0 1.00403 0.628998 1.11801 1.992 0.930023
rs143965216 T C 0 0 0.00100708 0 0 0 0
rs57893585 A T 1 1.99799 0 1 0 0 0
rs189380075 G C 0 0.00299072 0.026001 0.00100708 0 0 0
rs6693975 A G 1 1.362 0 1 0 0 0
rs6693982 A G 0.878998 1.991 0 1 0 0 0
rs141204614 T G 0.00500488 0.00100708 0.276001 0 0.00799561 1.05502 0.130005
rs12038530 A T 1 1.99799 0 1 0 0 0
rs12037307 A C 1 1.35101 0 1 0 0 0
rs12039548 C T 1 1.99899 0 1 0 0 0
rs12034573 A G 1 1.99899 0 1 0 0 0
rs10909864 C T 1 2 0 1 0 0 0
rs1890338 T C 1 2 0 1 0 0 0
rs58311021 A G 0 0.621979 0 0 0 0 0
rs12039132 T C 1 1.97199 0 1 0 0 0
rs7514166 C G 1 2 0 1 0 0 0
rs34408665 C T 1 2 0 1 0 0 0
rs58052947 C T 1 2 0 1 0 0 0
rs56398543 A G 1 2 0 1 0 0 0
rs6661908 A G 1 2 0 1 0 0 0
rs41334845 T G 1 0.269012 0 1 0 0 0
rs6679678 C T 1 2 0 1 0 0 0
rs12042403 C T 0.45401 1.99799 0 1 0 0 0
rs12041583 G A 0.45401 1.99799 0 1 0 0 0
rs12041634 T A 1 2 0 1 0 0 0
rs12040299 T C 1 2 0 1 0 0 0
rs10489589 C T 1 2 0 1 0 0 0
rs72629493 A G 1 2 0 1 0 0 0
rs72629494 C T 1 2 0 1 0 0 0
rs12408197 T C 1 2 0 1 0 0 0
rs67025412 C T 1 2 0 1 0 0 0
rs10489588 A G 1 2 0 1 0 0 0
rs67630839 T C 1 2 0 1 0 0 0
rs78841195 G A 0 0 0 0 0 0 0
rs1005471 C T 1 2 0 1 0 0 0
rs9660819 T C 1 2 0 1 0 0 0
rs9661525 G T 1 2 0 1 0 0 0
rs72629496 A G 1 0.269012 0 1 0 0 0
rs16823392 G C 1 2 0 1 0 0 0
rs116556794 T C 0 0 0 0 0 0 0
rs74048148 G A 1 1.99701 0 1 0 0 0
rs77415878 G C 1 1.99701 0 1 0 0 0
rs113272167 T C 1 1.99701 0 1 0 0 0
rs112690788 G A 1 1.99701 0 1 0 0 0
rs72856433 A G 1 1.99701 0 1 0 0 0
rs72629497 C T 1 2 0 1 0 0 0
rs67872056 T C 1 2 0 1 0 0 0
rs72629498 T C 1 1.35199 0 1 0 0 0
rs67289919 A G 1 2 0 1 0 0 0
rs67919793 C T 1 2 0 1 0 0 0
rs67908307 C T 1 2 0 1 0 0 0
rs75400926 C T 0.877991 1.82599 0 0.928009 0 0 0
rs12036378 A G 1 2 0.00100708 1 0 0 0
rs55863899 G C 1 0.268005 0 1 0 0 0
rs74048152 T G 0 0.64801 0 0 0 0 0
rs66693767 G A 1 2 0 1 0 0 0
rs10909865 G A 1 2 0.00299072 1 0 0 0
rs72856440 G A 1 2 0 1 0 0 0
rs2993510 C T 1 2 0 1 0 0 0
rs113884577 A G 0.0180054 0 0.0039978 0.0239868 0.0039978 0 0
rs12567620 G T 1 2 0 1 0 0 0
rs12566487 A G 0 0.64801 0 0 0 0 0
rs12409277 C T 1 2 0 1 0 0 0
rs12044162 G A 1 1.992 0 1 0 0 0
rs12724379 A C 0.00100708 0 0.312988 0.0039978 0.00100708 0 0
rs6684233 A C 0 0.650024 0 0 0 0 0
rs6692292 C T 1 1.93701 0 1 0.00201416 0 0
rs56213325 A G 0.0039978 0 0 0.0190125 0.00100708 0.00100708 0.00201416
rs12045368 G A 1 1.28 0 1 0.00201416 0 0
chr1:2960196:I TCAAACAAA T 1 1.931 0 0.98999 0.00201416 0 0
rs10909866 T C 1 1.92801 0 1 0.00201416 0 0
rs12046233 G A 1 1.28 0 1 0.00201416 0 0
rs12044848 T C 1 1.92499 0 1 0.00201416 0 0
rs12120940 A G 0.0150146 0.0180054 0.0710144 0.0299988 0.0100098 0.0820007 0.00900269
rs2651900 T G 0.388 0.0039978 0.665985 0.230011 0.600006 0 0.21698
rs2075968 T C 0.442993 0 0.312988 0.756989 0.320007 0 0.888001
rs1627080 C T 1.00201 0.0969849 1.009 1.008 1 0.00799561 2
rs116609258 A C 0.0820007 0 0.556 0.00900269 0.0830078 0 0.019989
rs117719590 C T 0.0830078 0 0.0230103 0.0100098 0.0710144 0 0.871978
rs2817128 C T 2 2 2 2 1.99799 2 1.91199
rs2817129 C G 1.68399 1.923 1.327 1.867 1.48898 1.99399 1.84402
rs79551700 T C 0.07901 0.00100708 0.0039978 0.0429993 0.0570068 0.00201416 0.0320129
rs35019643 G C 0.07901 0.00100708 0.0039978 0.0429993 0.0570068 0.00201416 0.0320129
chr1:3089167:I CG C 0.07901 0.00100708 0.0039978 0.0429993 0.0570068 0.00201416 0.0320129
rs34503920 T C 0.07901 0.00201416 0.0039978 0.0429993 0.0570068 0.00201416 0.0320129
rs2455110 A G 0.132995 0.0329895 0.0870361 0.0930176 0.510986 0.137024 0.052002
rs56788035 A G 0.0700074 0.00799561 0.0629883 0.0209961 0.427032 0.0799866 0.0409851
rs2455109 T A 0.15799 0.0549927 0.0950317 0.557984 0.529023 0.150024 0.10199
rs11807062 C A 0.15799 0.332031 0.0940246 0.543 0.526001 0.150024 0.0969849
rs60951434 A G 0.0700074 0.00799561 0.0629883 0.0220032 0.485992 0.0799866 0.0409851
rs113002511 G A 0.00100708 0.00799561 0.00601196 0.0100098 0.0100098 0.0100098 0.00201416
rs4648372 G A 0.156983 0.32901 0.0950317 0.55603 0.473999 0.0809937 0.0960083
rs4648373 T C 0.156983 0.32901 0.0940246 0.548981 0.466003 0.0809937 0.0549927
rs10909907 G A 0.165009 0.32901 0.0950317 0.55603 0.843994 0.0849915 0.0950012
rs2249336 C T 0.132995 0.307984 0.0910339 0.107971 0.46698 0.0700073 0.0539856
rs2249319 T C 0.0710144 0.00799561 0.0650024 0.0239868 0.437988 0.0150146 0.0400085
rs2249318 T C 0.0710144 0.00799561 0.0650024 0.0239868 0.437988 0.0150146 0.0400085
rs2455112 G A 0.413971 0.323975 0.36502 0.163025 0.5 0.369018 1.13498
rs12755504 G A 0.0419922 0.00799561 0.0599976 0.026001 0.378021 0.0130005 0.0390015
rs12071366 G C 1 0 0.00100708 0 0 0 0
rs6424052 G A 0 0.998993 0.0509949 1.98901 0 0 0.993988
rs7543006 G C 1 0 0 0 0 0 0
rs56710549 G C 0 0 0.234985 0 0 0 0
rs140255410 A G 0 0.00601196 0.0130005 0 0.0190125 0.00201416 0.00299072
rs111484302 G A 0 0 0.467987 0 0 0 0
rs9424286 T G 0.0100098 0 0.0379944 0 0 0 0.00299072
rs112460330 C T 0 0 0 0 0 0 0
chr1:3716325:I GCCCTGCACCC G 0.00100708 0 0 0.0039978 0 0 0
rs9424299 G A 1.96399 1.72199 1.73801 2 0.850006 0.98703 1.85998
rs60615207 A G 0.0509949 0.00698853 0.237 0.0180054 0.0249939 0 0.0339966
rs77743432 A C 0.473999 0 0.00100708 0 0 0 0

Community
  • 1
  • 1
  • You should expand with more detail of how that 'key-variable` behavior is supposed to work.. Tentatively, I'd probably use perl with `split` to break the columns into arrays, and then compare. But details? Fuzzy questions get fuzzy answers – infixed Mar 22 '16 at 17:10
  • I've changed it. Thanks for the feedback. – Sander W. van der Laan Mar 22 '16 at 17:40
  • 1
    Edit your question to include some concise, testable sample input and expected output to remove some of the guesswork. sed and awk are 2 completely different tools, there is no `awk/sed` just as there is no `perl/grep`. This is not a job for sed at all and there's no reason to assume an awk solution would be slower than a perl or python one. – Ed Morton Mar 22 '16 at 18:05
  • I've edited the post/question. I've included `head` of file1 and file2, as well as an expected result. I hope this is informative and one can help me instead of putting this question on hold. I understand the question might be to broad, however try to use the principle of charity: imagine you're a biologist with zero education in scripting trying to find a solution for this problem by asking people that are (more) educated in scripting for help. I would think that would go a long way. @thissuitisblacknot @andras-deak @mseifert @prasun @mpromonet . – Sander W. van der Laan Mar 24 '16 at 09:15

4 Answers4

2

You could use the command-line tool csvfix.

First you join the two tables together using the first column, then you only choose the rows that you want from the joined rows.

For example, if file1 has 6 columns and file2 has 1,529 columns, the joined rows will have 1,534 fields. (The key-variable isn't repeated.) So if you need the key-variable, columns 2 and 3 from file1, and the other columns from file2, you can specify them like so:

csvfix join -f 1:1 file2.csv file1.csv | csvfix order -f 1,1530,1531,4:1529

Note that file2.csv is listed first to preserve the order of that file's rows in the output.

If your fields are not delimited by commas, you can pick another separator using the -sep option.

nofinator
  • 2,906
  • 21
  • 25
  • Would this be fast? And do I/we assume that the files are in exactly the same order? – Sander W. van der Laan Mar 22 '16 at 17:41
  • @SanderW.vanderLaan It won't be terribly fast because the join step needs to pull both the enormous input files into memory. The order step should be more efficient at streaming the output to a file. Also, it looks like the ordering of the join output is based on the first file, not the second. I edited the answer so the row ordering in the output matches that of file2. – nofinator Mar 22 '16 at 18:34
1

Without sample input and expected output it's just a guess but this may be what you want:

$ cat file1
a b c
d e f
g h i

$ cat file2
a 2 3
e 4 5
h 6 7

$ join -o '1.1 2.2 2.3' file2 file1
a b c

or maybe:

$ join -a 2 file2 file1 | cut -d' ' -f1,2,3
a 2 3
d e f
g h i

or:

$ join -a 1 file2 file1 | awk '{print $1, $(NF-1), $NF}'
a b c
e 4 5
h 6 7

or....

Ed Morton
  • 188,023
  • 17
  • 78
  • 185
1

Updated   for clarifications: first columns need match each other; and, the other file is much smaller.


Processes line at a time of file1 and once it runs out copy the rest of file2.

use warnings;
use strict;

my $file1 = 'file1.txt';
my $file2 = 'file2.txt';
my $outfile = 'new_file2.txt';

open my $fh1, '<', $file1 or die "Can't open $file1: $!";
open my $fh2, '<', $file2 or die "Can't open $file2: $!";
open my $outfh, '>', $outfile;

my ($line1, @ln1, @ln2);
# Need be certain that file2 >= file1 (by number of lines)
while($line1 = <$fh1>) {
    @ln2 = split '\s+', <$fh2>, 4;
    @ln1 = split '\s+', $line1, 4;
    if ($ln1[0] eq $ln2[0]) 
    {
        @ln2[1,2] = @ln1[1,2];
    }
    print $outfh "@ln2";
}

# Copy the rest of file2
while (<$fh2>) { print $outfh }

# Close filehandles if not the end of script

Limit split to 4 -- the needed fields and the rest of the line. This can be sped up further, but the file1 is specified to be small, and by a factor of thousand in comparison with file2.

zdim
  • 64,580
  • 5
  • 52
  • 81
  • 1
    if his file2 was small enough, I considered making a hash array for file2 with the key being the column 1, the contents being columns 2+3, Then for each line in file 1, uses the `exists` function to see if the key exists in the hash. If so, replace the column 2-3 members with what is in the hash. The join aand print the result. But this approach depends on wheter you can get file2's hash table to fit in memory – infixed Mar 22 '16 at 19:43
  • @infixed It's not small -- OP say 8.3 million rows, and long ones at that. – zdim Mar 22 '16 at 19:45
  • 1
    Oh, I was thinking of them backwards. Even if I refactor my thinking, if file1 had 8 million rows, it's probably a problem even with only 6 items per row. Perhaps one of the files can be split to manageable sizes, processed, and then rejoined later. Flat file fun. – infixed Mar 22 '16 at 19:50
  • Thanks @zdim. This was very useful. Almost there. If this questions is not on hold anymore I'll post the answer... – Sander W. van der Laan Mar 24 '16 at 09:44
  • Thanks @zdim. This was very useful. Almost there. It throws an error at the line where it states `while ($cnt++ < $size2)`. The error is: `Argument "" isn't numeric in numeric lt (<) `. How do I interpret this? If this questions is not on hold anymore I'll post the answer... – Sander W. van der Laan Mar 24 '16 at 09:50
  • Also when I write this: `my $size1 = (split '\s+', `wc -l $file1`)[0]; print STDERR "* The size of file1 ($file1) is: $size1.\n"; print STDERR "\n"; my $size2 = (split '\s+', `wc -l $file2`)[0]; print STDERR "* The size of file2 ($file2) is: $file2.\n";`, the output is: `* The size of file1 (file1.txt) is: . * The size of file2 (file2.txt) is: file2.txt.`. I interpreted $size1 as a counting of lines. Is that not the case? – Sander W. van der Laan Mar 24 '16 at 10:02
  • 1
    @SanderW.vanderLaan Updated: column 1 is compared directly and not to a fixed key (right?), and files are processed differently, `file1` being so much smaller. Would've helped having this -- post carefully please. Also, there is no need to first get sizes, so external commands are gone. – zdim Mar 24 '16 at 20:33
1

When run, the following script produces the output shown under the Output heading. So long as file1 does not have too many lines, I think you'll find it quite satisfactory, in that memory requirements basically only depend on the space required to hold the first three columns in file1, and there is no dependence on any sorting:

#!/bin/bash

function file1 {
cat<<EOF
a b c
d e f
e x y
g h i
EOF
}

function file2 {
cat<<EOF
a 2 3 many columns one
e 4 5 many columns two
h 6 7 many columns three
EOF
}

awk '
  NR==FNR { 
    if (NF==0) { next }
    col2[$1]=$2; col3[$1]=$3; 
    next
  }
  { if ($1 in col2) { $2=col2[$1]; $3=col3[$1] } }
  { print }      
' <(file1) <(file2)

Output:

a b c many columns one
e x y many columns two
h 6 7 many columns three
peak
  • 105,803
  • 17
  • 152
  • 177