Find only those IDs from a list that are present in both of 2 data files

Question

I have a text file IDs.txt containing one unique ID string per line e.g.:

foo
bar
someOtherID

I know that some of these IDs are found in one or both of 2 other files with differently formatted data lines, 1.txt and 2.txt

1.txt
id=foo
name=example
age=81
end
id=notTheIDYouAreLookingFor
name=other
age=null

2.txt
<Data>
<ID>foo</ID>
<Stuff>Some things</Stuff>
</Data>
<Data>
<ID>bar</ID>
<Stuff>Other things</Stuff>
</Data>

The specific data formats are not important since all I need to answer is "which IDs are in both?", and indeed ideally I need a format-independent solution

In the example I want to find the lines with foo:

<ID>foo</ID> id=foo

Effectively: this question but grepping the large list of IDs against 2 files instead of 1 and finding the common hits.

`s/The specific data formats are not important/The specific data formats are all-important/`. You can't write a tool that magically knows that within some file some string `foo` is an ID as opposed to a name or Stuff or a tag instead of a value or anything else without knowing/parsing the format of that file. — Ed Morton, Feb 25 '19 at 23:11

score 1 · Answer 1 · answered Feb 25 '19 at 15:53

1

Since you just want to find out the ids in both files (f1 and f2), you don't have to parse the ids.txt:

awk 'NR==FNR{a["<ID>"$1"</ID>"]="id="$1;next}
    a[$0]{print $0,a[$0]}' <(grep -oP 'id=\K.*' f1) f2

the above one-liner outputs:

<ID>foo</ID> id=foo

answered Feb 25 '19 at 15:53

Kent

189,393
32
233
301

But that is not data format independent, is it? I need to configure it to whatever markup pattern is in my data files, correct? – Michael Feb 25 '19 at 15:57
1

@Michael in your question you said, "data formats are not improtant". It cannot be true. Otherwise, how can you tell, where the id value is located in a file? we parse json, xml, text, csv... formats via different logic. You cannot say, hey, here is my file no matter what format it has, give me the id value. – Kent Feb 25 '19 at 16:03

James Brown · Answer 2 · 2019-02-25T18:47:33.263

Here is one for GNU awk, far from perfect:

$ awk '
NR==FNR {                                      # store file1 entries to a[1]
    a[ARGIND][$0]
    next
}
match($0,/([iI][dD][>=])([^<]+)/,arr) {        # hash on whats =after or >between<
    a[ARGIND][arr[2]]=$0                       # store whole record. key on above
}
END {
    for(i in a[1])                             # get keywords from first file
        if((i in a[2]) && (i in a[3]))         # if found in files 2 and 3
            print a[2][i],a[3][i]              # output
}' file1 file2 file3

Output:

id=foo <ID>foo</ID>

score 0 · Answer 3 · answered Feb 25 '19 at 23:25

I'm not an awk expert so I tend to break things into chunks when a one-liner might do.

I'm going to assume that you've taken to heart the earlier comment that a simple format-independent solution is unlikely. Instead, I've taken the approach of documenting the format inside the script, and normalizing the two input formats. If a third format appears, then just modify the script to document and normalize that new format.

$ cat << EOF > work.sh
#!/usr/bin/env bash

# 1.txt has IDs in the form id=....

grep -x 'id=.*' 1.txt | sed -e 's/^id=//' | sort > 1.txt.ids

# 2.txt has IDs in the form <ID>...</ID>

grep -x '^<ID>.*</ID>' 2.txt | sed -Ee 's-^<ID>(.*)</ID>-\1-' | sort > 2.txt.ids

comm -12 1.txt.ids 2.txt.ids  | grep -xf IDs.txt
EOF

The first grep command extracts lines from 1.txt that are entirely composed of 'id=something', then strips off the 'id=' and sorts them into file 1.txt.ids.

The second grep does a similar thing for lines from 2.txt that are entirely composed of '<ID>something</ID>', then strips off the open and closing ID tags, and sorts the ids into 2.txt.ids.

comm is then used to show only the lines that appear in both files, and the output of comm is further filtered by IDs.txt, which is the list of specific IDs you're interested in.

$ cat 1.txt  
id=foo
name=example
age=81
end
id=notTheIDYouAreLookingFor
name=other
age=null
$ cat 2.txt
<Data>
<ID>foo</ID>
<Stuff>Some things</Stuff>
</Data>
<Data>
<ID>bar</ID>
<Stuff>Other things</Stuff>
</Data>
$ cat IDs.txt
foo
bar
someOtherID
$ bash work.sh
foo

Find only those IDs from a list that are present in both of 2 data files

3 Answers3