I am trying to use awk keys for comparison.
I used the following line in a different context.
The idea is to use the same method again for something positive regular expression like and to put it into a new file:
awk 'NR==FNR{a[$1,$2,$3]; next} !(($1,$2,$3) in a)' file2 file1
This time, the columns are not directly comparable:
file_A.txt
chr1 1 10000
chr2 2 500
chr3 1 20000
chr1 10 15
file_B.narrowPeak
abs42322 chr1 25 15000
rvy42134 chr2 1 400
ttx24124 chr3 1 20000
sadas664 chr1 3 14
The Columns 1, 3 and 4 of the file B has to be ignored for the comparison. I want to store all lines of file B that matches in column 2 a string of file A column 1.
The columns 2 & 3 of file A and 3 & 4 of file B are ranges. In this example, the start of the first range is
1
and the end is10000
, the second range is1
to400
and so an... The next step should be filtering-out lines of file B when these ranges are not included in one of the ranges of file A, only comparing lines, that matched in the first step.
Example:
Line 1 of file B is compared with line 1 and 4 of file A, because of
chr1
. The range 25-15000
is bigger than 1-10000
and 10-15
, so this line is filtered out.
Line 3 of file B is compared with line 3 of file A, because of chr3
. The
range 1-20000
is included in (here equal to) 1-20000
, so this line is stored in thee output file.
Output file
ttx24124 chr3 1 20000
sadas664 chr1 3 14
Edit: The real data are looking like this. In reality, the files are much longer and because of that, column 2 is much more diverse as shown below.
File A
chr2 16148738 89330679
chr2 10845 16143362
chr2 94570062 106475164
chr2 99510860 113404812
chr2 86925269 87697988
chr2 91415844 91839817
chr9 64343270 64801485
chr9 65740027 66179306
chr1 144610018 144888777
chr2 95802871 108756829
chr16_KI270728v1_random 173055 1246276
chr9 63252862 63477334
chr2_KI270774v1_alt 0 188910
chr1_KI270712v1_random 7198 176043
chr9 63008373 63202857
chr2_GL383521v1_alt 0 143390
chr2 89530679 89663939
chr2 90236570 90402011
chr2_KI270894v1_alt 42931 213658
chr1 143320490 144356500
chr2 108732003 109758895
chr2_KI270770v1_alt 8875 136240
chr9 65130082 65281495
chr2 89767603 89960747
chr2_KI270769v1_alt 0 116362
chr2 94187600 94293015
chr9 40238354 40677933
chr2_KI270772v1_alt 1330 133041
chr16 33932082 34096118
chr13 18259709 18357163
chr14_KI270725v1_random 22583 138472
chr16 34779380 34943880
chr7 60892044 60992155
chr2_KI270773v1_alt 0 70886
chr2 110445435 110530068
chr9 43236167 43304276
chr22 10628203 10690626
chr2 87340235 87402756
chr21 8651170 8706715
chrUn_KI270744v1 38861 105138
chr2 110395939 110441265
chr2 109930242 109975557
chr1 143315267 144153087
chr17 26716619 26775606
File B
HumanGM18558_peak_1 chr1 9997 10330 150 . 10.78887 18.86368 15.08777 100
HumanGM18558_peak_2 chr1 628885 635117 2509 . 83.77238 255.95094 250.99944 5270
HumanGM18558_peak_3 chr1 1250086 1250413 94 . 8.25031 13.14358 9.49110 143
HumanGM18558_peak_4 chr1 1724342 1724642 56 . 6.34639 9.18460 5.65124 88
HumanGM18558_peak_5 chr1 8629404 8629679 56 . 6.34639 9.18460 5.65124 180
HumanGM18558_peak_6 chr1 9181157 9181438 56 . 6.34639 9.18460 5.65124 65
HumanGM18558_peak_7 chr1 9626296 9626600 56 . 6.34639 9.18460 5.65124 247
HumanGM18558_peak_8 chr1 11908028 11908531 341 . 18.40454 38.14250 34.12190 246
HumanGM18558_peak_9 chr1 11909636 11910042 81 . 7.61567 11.78841 8.17169 150
HumanGM18558_peak_10 chr1 15966215 15966638 81 . 7.61567 11.78841 8.17169 200
HumanGM18558_peak_11 chr1 16513837 16514451 591 . 27.28949 63.33707 59.13566 271
HumanGM18558_peak_12 chr1 16613629 16613934 81 . 7.61567 11.78841 8.17169 103
HumanGM18558_peak_13 chr1 16644496 16644800 68 . 6.98103 10.46777 6.88890 191
HumanGM18558_peak_14 chr1 16666545 16667135 291 . 16.50062 33.08122 29.10692 306
HumanGM18558_peak_15 chr1 16740126 16740977 307 . 17.13526 34.75273 30.76194 453
HumanGM18558_peak_16 chr1 16895871 16896489 517 . 24.75093 55.90571 51.76084 254
HumanGM18558_peak_17 chr1 16905126 16905616 242 . 14.59670 28.16750 24.24907 224
HumanGM18558_peak_18 chr1 21294320 21294624 81 . 7.61567 11.78841 8.17169 161
HumanGM18558_peak_19 chr1 24744867 24745154 68 . 6.98103 10.46777 6.88890 136
HumanGM18558_peak_20 chr1 24900187 24900971 94 . 8.25031 13.14358 9.49110 526
HumanGM18558_peak_21 chr1 24930434 24930704 56 . 6.34639 9.18460 5.65124 209
HumanGM18558_peak_22 chr1 25022463 25022733 81 . 7.61567 11.78841 8.17169 177
HumanGM18558_peak_23 chr1 25998134 25998419 68 . 6.98103 10.46777 6.88890 96
HumanGM18558_peak_24 chr1 26541891 26542188 68 . 6.98103 10.46777 6.88890 86
HumanGM18558_peak_25 chr1 26744090 26744360 81 . 7.61567 11.78841 8.17169 163
HumanGM18558_peak_26 chr1 26890007 26890277 44 . 5.71175 7.94242 4.46638 52
HumanGM18558_peak_27 chr1 27322070 27322340 56 . 6.34639 9.18460 5.65124 136
HumanGM18558_peak_28 chr1 27631584 27631967 108 . 8.88495 14.53075 10.84614 241
HumanGM18558_peak_29 chr1 27884095 27884365 56 . 6.34639 9.18460 5.65124 170
HumanGM18558_peak_30 chr1 28510350 28510620 68 . 6.98103 10.46777 6.88890 238
HumanGM18558_peak_31 chr1 28510787 28511122 56 . 6.34639 9.18460 5.65124 109
HumanGM18558_peak_32 chr1 28648490 28649063 307 . 17.13526 34.75273 30.76194 238
HumanGM18558_peak_33 chr1 28736505 28736783 68 . 6.98103 10.46777 6.88890 135
HumanGM18558_peak_34 chr1 31431897 31432219 56 . 6.34639 9.18460 5.65124 84
HumanGM18558_peak_35 chr1 31944389 31944659 56 . 6.34639 9.18460 5.65124 42
HumanGM18558_peak_36 chr1 32250032 32250320 56 . 6.34639 9.18460 5.65124 42
HumanGM18558_peak_37 chr1 37477246 37477607 94 . 8.25031 13.14358 9.49110 211
HumanGM18558_peak_38 chr1 37989885 37990303 122 . 9.51959 15.94772 12.23132 244
HumanGM18558_peak_39 chr1 39026095 39026365 68 . 6.98103 10.46777 6.88890 108
HumanGM18558_peak_40 chr1 40668966 40669236 56 . 6.34639 9.18460 5.65124 77
HumanGM18558_peak_41 chr1 44721466 44721913 258 . 15.23134 29.78794 25.84961 210
HumanGM18558_peak_42 chr1 44730832 44731120 94 . 8.25031 13.14358 9.49110 172
HumanGM18558_peak_43 chr1 44819632 44819969 122 . 9.51959 15.94772 12.23132 169
HumanGM18558_peak_44 chr1 46132753 46133023 56 . 6.34639 9.18460 5.65124 233
HumanGM18558_peak_45 chr1 46331051 46331321 68 . 6.98103 10.46777 6.88890 141
HumanGM18558_peak_46 chr1 66282467 66282777 108 . 8.88495 14.53075 10.84614 140
HumanGM18558_peak_47 chr1 78004335 78004605 81 . 7.61567 11.78841 8.17169 128
HumanGM18558_peak_48 chr1 88684186 88684456 56 . 6.34639 9.18460 5.65124 62
HumanGM18558_peak_49 chr1 91387139 91387504 94 . 8.25031 13.14358 9.49110 129
HumanGM18558_peak_50 chr1 93079024 93079327 94 . 8.25031 13.14358 9.49110 182
HumanGM18558_peak_51 chr1 101235617 101235902 68 . 6.98103 10.46777 6.88890 121
HumanGM18558_peak_52 chr1 101407748 101408136 81 . 7.61567 11.78841 8.17169 246
HumanGM18558_peak_53 chr1 109099999 109100368 122 . 9.51959 15.94772 12.23132 222
HumanGM18558_peak_54 chr1 109984498 109984792 81 . 7.61567 11.78841 8.17169 107
HumanGM18558_peak_55 chr1 110902916 110903186 56 . 6.34639 9.18460 5.65124 92
HumanGM18558_peak_56 chr1 111215999 111216474 108 . 8.88495 14.53075 10.84614 257
HumanGM18558_peak_57 chr1 111221711 111222087 68 . 6.98103 10.46777 6.88890 152
HumanGM18558_peak_58 chr1 113904864 113905420 81 . 7.61567 11.78841 8.17169 258
HumanGM18558_peak_59 chr1 116504467 116504737 68 . 6.98103 10.46777 6.88890 165
HumanGM18558_peak_60 chr1 116558228 116558508 94 . 8.25031 13.14358 9.49110 175
HumanGM18558_peak_61 chr1 120850520 120851089 481 . 23.48165 52.25492 48.13765 265
HumanGM18558_peak_62 chr1 125069249 125069729 122 . 9.51959 15.94772 12.23132 240
HumanGM18558_peak_63 chr1 125080252 125080535 44 . 5.71175 7.94242 4.46638 150
HumanGM18558_peak_64 chr1 125080944 125081214 44 . 5.71175 7.94242 4.46638 181
HumanGM18558_peak_65 chr1 125166080 125168950 762 . 33.00124 80.62179 76.28172 1813
HumanGM18558_peak_66 chr1 125168955 125169667 68 . 6.98103 10.46777 6.88890 462
HumanGM18558_peak_67 chr1 125169674 125170842 392 . 20.30845 43.33632 39.27747 271
HumanGM18558_peak_68 chr1 125170903 125171408 195 . 12.69278 23.42019 19.56689 240
HumanGM18558_peak_69 chr1 125173576 125174604 195 . 12.69278 23.42019 19.56689 561
HumanGM18558_peak_70 chr1 125175148 125176443 427 . 21.57773 46.86636 42.78468 916
HumanGM18558_peak_71 chr1 125176541 125184739 4637 . 138.35135 469.20218 463.71423 3666
HumanGM18558_peak_72 chr1 143184419 143188606 1999 . 69.81032 204.82724 199.97639 690
HumanGM18558_peak_73 chr1 143188729 143198082 3304 . 104.71547 335.55066 330.42758 4947
HumanGM18558_peak_74 chr1 143198227 143204460 2867 . 93.29197 291.73703 286.70563 4484
HumanGM18558_peak_75 chr1 143204483 143204990 150 . 10.78887 18.86368 15.08777 256
HumanGM18558_peak_76 chr1 143205353 143208069 2675 . 88.21485 272.56412 267.57269 950
HumanGM18558_peak_77 chr1 143208226 143210053 358 . 19.03918 39.85970 35.82584 1250
HumanGM18558_peak_78 chr1 143210072 143225450 4051 . 123.75465 410.42435 405.11169 4606
HumanGM18558_peak_79 chr1 143225537 143226480 226 . 13.96206 26.56550 22.66770 496
HumanGM18558_peak_80 chr1 143226822 143242516 2771 . 90.75341 282.12637 277.11282 6269