1

I am trying to use awk keys for comparison.
I used the following line in a different context. The idea is to use the same method again for something positive regular expression like and to put it into a new file:

awk 'NR==FNR{a[$1,$2,$3]; next} !(($1,$2,$3) in a)' file2 file1

This time, the columns are not directly comparable:

file_A.txt 

chr1    1        10000
chr2    2        500
chr3    1        20000
chr1    10       15

file_B.narrowPeak

abs42322   chr1    25       15000
rvy42134   chr2    1        400
ttx24124   chr3    1        20000
sadas664   chr1    3        14
  1. The Columns 1, 3 and 4 of the file B has to be ignored for the comparison. I want to store all lines of file B that matches in column 2 a string of file A column 1.

  2. The columns 2 & 3 of file A and 3 & 4 of file B are ranges. In this example, the start of the first range is 1 and the end is 10000, the second range is 1 to 400and so an... The next step should be filtering-out lines of file B when these ranges are not included in one of the ranges of file A, only comparing lines, that matched in the first step.

Example: Line 1 of file B is compared with line 1 and 4 of file A, because of chr1. The range 25-15000 is bigger than 1-10000 and 10-15, so this line is filtered out. Line 3 of file B is compared with line 3 of file A, because of chr3. The range 1-20000 is included in (here equal to) 1-20000, so this line is stored in thee output file.

Output file

ttx24124   chr3    1        20000
sadas664   chr1    3        14

Edit: The real data are looking like this. In reality, the files are much longer and because of that, column 2 is much more diverse as shown below.

File A

chr2 16148738 89330679
chr2 10845 16143362
chr2 94570062 106475164
chr2 99510860 113404812
chr2 86925269 87697988
chr2 91415844 91839817
chr9 64343270 64801485
chr9 65740027 66179306
chr1 144610018 144888777
chr2 95802871 108756829
chr16_KI270728v1_random 173055 1246276
chr9 63252862 63477334
chr2_KI270774v1_alt 0 188910
chr1_KI270712v1_random 7198 176043
chr9 63008373 63202857
chr2_GL383521v1_alt 0 143390
chr2 89530679 89663939
chr2 90236570 90402011
chr2_KI270894v1_alt 42931 213658
chr1 143320490 144356500
chr2 108732003 109758895
chr2_KI270770v1_alt 8875 136240
chr9 65130082 65281495
chr2 89767603 89960747
chr2_KI270769v1_alt 0 116362
chr2 94187600 94293015
chr9 40238354 40677933
chr2_KI270772v1_alt 1330 133041
chr16 33932082 34096118
chr13 18259709 18357163
chr14_KI270725v1_random 22583 138472
chr16 34779380 34943880
chr7 60892044 60992155
chr2_KI270773v1_alt 0 70886
chr2 110445435 110530068
chr9 43236167 43304276
chr22 10628203 10690626
chr2 87340235 87402756
chr21 8651170 8706715
chrUn_KI270744v1 38861 105138
chr2 110395939 110441265
chr2 109930242 109975557
chr1 143315267 144153087
chr17 26716619 26775606

File B

HumanGM18558_peak_1 chr1    9997    10330   150 .   10.78887    18.86368    15.08777    100
HumanGM18558_peak_2 chr1    628885  635117  2509    .   83.77238    255.95094   250.99944   5270
HumanGM18558_peak_3 chr1    1250086 1250413 94  .   8.25031 13.14358    9.49110 143
HumanGM18558_peak_4 chr1    1724342 1724642 56  .   6.34639 9.18460 5.65124 88
HumanGM18558_peak_5 chr1    8629404 8629679 56  .   6.34639 9.18460 5.65124 180
HumanGM18558_peak_6 chr1    9181157 9181438 56  .   6.34639 9.18460 5.65124 65
HumanGM18558_peak_7 chr1    9626296 9626600 56  .   6.34639 9.18460 5.65124 247
HumanGM18558_peak_8 chr1    11908028    11908531    341 .   18.40454    38.14250    34.12190    246
HumanGM18558_peak_9 chr1    11909636    11910042    81  .   7.61567 11.78841    8.17169 150
HumanGM18558_peak_10    chr1    15966215    15966638    81  .   7.61567 11.78841    8.17169 200
HumanGM18558_peak_11    chr1    16513837    16514451    591 .   27.28949    63.33707    59.13566    271
HumanGM18558_peak_12    chr1    16613629    16613934    81  .   7.61567 11.78841    8.17169 103
HumanGM18558_peak_13    chr1    16644496    16644800    68  .   6.98103 10.46777    6.88890 191
HumanGM18558_peak_14    chr1    16666545    16667135    291 .   16.50062    33.08122    29.10692    306
HumanGM18558_peak_15    chr1    16740126    16740977    307 .   17.13526    34.75273    30.76194    453
HumanGM18558_peak_16    chr1    16895871    16896489    517 .   24.75093    55.90571    51.76084    254
HumanGM18558_peak_17    chr1    16905126    16905616    242 .   14.59670    28.16750    24.24907    224
HumanGM18558_peak_18    chr1    21294320    21294624    81  .   7.61567 11.78841    8.17169 161
HumanGM18558_peak_19    chr1    24744867    24745154    68  .   6.98103 10.46777    6.88890 136
HumanGM18558_peak_20    chr1    24900187    24900971    94  .   8.25031 13.14358    9.49110 526
HumanGM18558_peak_21    chr1    24930434    24930704    56  .   6.34639 9.18460 5.65124 209
HumanGM18558_peak_22    chr1    25022463    25022733    81  .   7.61567 11.78841    8.17169 177
HumanGM18558_peak_23    chr1    25998134    25998419    68  .   6.98103 10.46777    6.88890 96
HumanGM18558_peak_24    chr1    26541891    26542188    68  .   6.98103 10.46777    6.88890 86
HumanGM18558_peak_25    chr1    26744090    26744360    81  .   7.61567 11.78841    8.17169 163
HumanGM18558_peak_26    chr1    26890007    26890277    44  .   5.71175 7.94242 4.46638 52
HumanGM18558_peak_27    chr1    27322070    27322340    56  .   6.34639 9.18460 5.65124 136
HumanGM18558_peak_28    chr1    27631584    27631967    108 .   8.88495 14.53075    10.84614    241
HumanGM18558_peak_29    chr1    27884095    27884365    56  .   6.34639 9.18460 5.65124 170
HumanGM18558_peak_30    chr1    28510350    28510620    68  .   6.98103 10.46777    6.88890 238
HumanGM18558_peak_31    chr1    28510787    28511122    56  .   6.34639 9.18460 5.65124 109
HumanGM18558_peak_32    chr1    28648490    28649063    307 .   17.13526    34.75273    30.76194    238
HumanGM18558_peak_33    chr1    28736505    28736783    68  .   6.98103 10.46777    6.88890 135
HumanGM18558_peak_34    chr1    31431897    31432219    56  .   6.34639 9.18460 5.65124 84
HumanGM18558_peak_35    chr1    31944389    31944659    56  .   6.34639 9.18460 5.65124 42
HumanGM18558_peak_36    chr1    32250032    32250320    56  .   6.34639 9.18460 5.65124 42
HumanGM18558_peak_37    chr1    37477246    37477607    94  .   8.25031 13.14358    9.49110 211
HumanGM18558_peak_38    chr1    37989885    37990303    122 .   9.51959 15.94772    12.23132    244
HumanGM18558_peak_39    chr1    39026095    39026365    68  .   6.98103 10.46777    6.88890 108
HumanGM18558_peak_40    chr1    40668966    40669236    56  .   6.34639 9.18460 5.65124 77
HumanGM18558_peak_41    chr1    44721466    44721913    258 .   15.23134    29.78794    25.84961    210
HumanGM18558_peak_42    chr1    44730832    44731120    94  .   8.25031 13.14358    9.49110 172
HumanGM18558_peak_43    chr1    44819632    44819969    122 .   9.51959 15.94772    12.23132    169
HumanGM18558_peak_44    chr1    46132753    46133023    56  .   6.34639 9.18460 5.65124 233
HumanGM18558_peak_45    chr1    46331051    46331321    68  .   6.98103 10.46777    6.88890 141
HumanGM18558_peak_46    chr1    66282467    66282777    108 .   8.88495 14.53075    10.84614    140
HumanGM18558_peak_47    chr1    78004335    78004605    81  .   7.61567 11.78841    8.17169 128
HumanGM18558_peak_48    chr1    88684186    88684456    56  .   6.34639 9.18460 5.65124 62
HumanGM18558_peak_49    chr1    91387139    91387504    94  .   8.25031 13.14358    9.49110 129
HumanGM18558_peak_50    chr1    93079024    93079327    94  .   8.25031 13.14358    9.49110 182
HumanGM18558_peak_51    chr1    101235617   101235902   68  .   6.98103 10.46777    6.88890 121
HumanGM18558_peak_52    chr1    101407748   101408136   81  .   7.61567 11.78841    8.17169 246
HumanGM18558_peak_53    chr1    109099999   109100368   122 .   9.51959 15.94772    12.23132    222
HumanGM18558_peak_54    chr1    109984498   109984792   81  .   7.61567 11.78841    8.17169 107
HumanGM18558_peak_55    chr1    110902916   110903186   56  .   6.34639 9.18460 5.65124 92
HumanGM18558_peak_56    chr1    111215999   111216474   108 .   8.88495 14.53075    10.84614    257
HumanGM18558_peak_57    chr1    111221711   111222087   68  .   6.98103 10.46777    6.88890 152
HumanGM18558_peak_58    chr1    113904864   113905420   81  .   7.61567 11.78841    8.17169 258
HumanGM18558_peak_59    chr1    116504467   116504737   68  .   6.98103 10.46777    6.88890 165
HumanGM18558_peak_60    chr1    116558228   116558508   94  .   8.25031 13.14358    9.49110 175
HumanGM18558_peak_61    chr1    120850520   120851089   481 .   23.48165    52.25492    48.13765    265
HumanGM18558_peak_62    chr1    125069249   125069729   122 .   9.51959 15.94772    12.23132    240
HumanGM18558_peak_63    chr1    125080252   125080535   44  .   5.71175 7.94242 4.46638 150
HumanGM18558_peak_64    chr1    125080944   125081214   44  .   5.71175 7.94242 4.46638 181
HumanGM18558_peak_65    chr1    125166080   125168950   762 .   33.00124    80.62179    76.28172    1813
HumanGM18558_peak_66    chr1    125168955   125169667   68  .   6.98103 10.46777    6.88890 462
HumanGM18558_peak_67    chr1    125169674   125170842   392 .   20.30845    43.33632    39.27747    271
HumanGM18558_peak_68    chr1    125170903   125171408   195 .   12.69278    23.42019    19.56689    240
HumanGM18558_peak_69    chr1    125173576   125174604   195 .   12.69278    23.42019    19.56689    561
HumanGM18558_peak_70    chr1    125175148   125176443   427 .   21.57773    46.86636    42.78468    916
HumanGM18558_peak_71    chr1    125176541   125184739   4637    .   138.35135   469.20218   463.71423   3666
HumanGM18558_peak_72    chr1    143184419   143188606   1999    .   69.81032    204.82724   199.97639   690
HumanGM18558_peak_73    chr1    143188729   143198082   3304    .   104.71547   335.55066   330.42758   4947
HumanGM18558_peak_74    chr1    143198227   143204460   2867    .   93.29197    291.73703   286.70563   4484
HumanGM18558_peak_75    chr1    143204483   143204990   150 .   10.78887    18.86368    15.08777    256
HumanGM18558_peak_76    chr1    143205353   143208069   2675    .   88.21485    272.56412   267.57269   950
HumanGM18558_peak_77    chr1    143208226   143210053   358 .   19.03918    39.85970    35.82584    1250
HumanGM18558_peak_78    chr1    143210072   143225450   4051    .   123.75465   410.42435   405.11169   4606
HumanGM18558_peak_79    chr1    143225537   143226480   226 .   13.96206    26.56550    22.66770    496
HumanGM18558_peak_80    chr1    143226822   143242516   2771    .   90.75341    282.12637   277.11282   6269    
anubhava
  • 761,203
  • 64
  • 569
  • 643
  • 4
    What do you mean by this: `when their range is not included in one of the ranges of file A` What are the ranges? `25-15000`, `1-400`, etc.? And you test them against ranges of the same id, e.g.`1-400` of `chr2` is tested with `2-500` of `chr2` again and it is not included, yes, but also `25-150000` was not incuded for `chr1` but you don't want it. – thanasisp Aug 31 '20 at 00:01
  • 3
    Your example data is adding confusion (I think) to explaining your problem. If you really want the values indented as shown, please add a note in the body of your Q that the data is really indented, else please update your data samples to show that they start at the left margin. Good luck! – shellter Aug 31 '20 at 00:40
  • You are right, I made a few mistakes in the examples. Is it clearer now? I am currently trying to upload two files, with the real input data. – Sebastian171296 Aug 31 '20 at 06:14
  • 1
    Your try to explaing something about `chr1` with "Line 1 of file B is compared with line 1 and 4 of file A, because of chr1. The range 25-15000 is bigger than 1-10000 and 10-15, so this line is filtered out.", But you are not explaining why the result has "sadas664 chr1 3 14" . – Luuk Aug 31 '20 at 09:45

1 Answers1

3

Convert your input to bed format. The 3 required fields are chromosome, start position and end position. The rest of the fields are optional. Then use bedtools intersect from the bedtools package. For example:

# Create input files:

cat > file_A.txt <<EOF
chr1    1        10000
chr2    2        500
chr3    1        20000
chr1    10       15
EOF

cat > file_B.narrowPeak <<EOF
abs42322   chr1    25       15000
rvy42134   chr2    1        400
ttx24124   chr3    1        20000
sadas664   chr1    3        14
EOF

# Convert to bed format:
perl -lane 'print join "\t", @F;' file_A.txt > file_A.bed

perl -lane 'print join "\t", @F[1, 2, 3];' file_B.narrowPeak > file_B.bed

# Find feature in file_B.bed contained entirely in file_A.bed:
bedtools intersect -a file_B.bed -b file_A.bed -wa -f 1.0 > file_A_in_B.bed

Output:

chr3    1       20000
chr1    3       14

bedtools intersect command is used with these options:

-wa : Write the original entry of the file specified in the -a option (file_B.bed) for each overlap.
-f : Minimum overlap required as a fraction of the file specified in the -a option. Using fraction = 1.0 to ensure that 100% of file_B.bed feature is included in file_A.bed.

Timur Shtatland
  • 12,024
  • 2
  • 30
  • 47