Using "comm" to find matches between two arrays

Question

I have two arrays, I am trying to find matching values using comm. Array1 contains some additional information in each element that I strip out for the comparison. However, I would like to keep that information after the comparison is complete.

For example:

Array1=("abc",123,"hello" "def",456,"world")
Array2=("abc")
declare -a Array1
declare -a Array2

I then compare the two arrays:

oldIFS=$IFS IFS=$'\n\t'
array3=($(comm -12 <(echo "${Array1[*]}" | awk -F "," {'print $1'} | sort) <(echo "${Array2[*]}" | sort)))
IFS=$oldIFS

Which finds the match of abc:

echo ${test3[0]}
abc

However what I want is remaining values from array1 that were not part of my comm statement.

abc,123,hello

EDIT: For more clarification

The arrays in this example are populated with dummy data.

My real example is pulling information from server logs which I am saving into array1. array1 contains (userIDs,hostIPs,count) that I want to cross reference against a list of userID's (array2). My goal is to find out what userIDs exsist in array1 and array2 and save those ID's with the additional information from array1 (hostIPs,count) into array3

array1 is populated from a variable that is is the results of a curl command that generates a splunk search. The data returned looks like this:

"uniqueID=<ID>","<IP>","<hostname>",1

I save the results of the splunk report as $splunk, and then decalare array1 with the results of $splunk - the header information since the results come back in csv format

array1=( $(echo $splunk | sed 's/ /\n/g' | sed 1d) )

array2 is generated from a master file that I have stored locally. That contains all the application ID's in our ecosystem. For example

uid=<ID>

I cat the contents of the master file into array2

array2=( $(cat master.txt) )

I then want to find what IDs from array1 exsist in array2 and save that as array3. This requires some massaging of the data in array1 to make it match the format of array2.

oldIFS=$IFS IFS=$'\n\t'
array3=($(comm -12 <(echo "${array1[*]}" | sed 's/ /\n/g' | awk -F "\"," {'print $1'} | sed 's/\"//g' | sed 's/|/ /g' | awk -F$'=' -v OFS=$'=' '{ $1 = "uid" }1' | grep -i "OU=People" | sed 's/OU/ou/g' | sort) <(echo "${array2[*]}" | sort)))
IFS=$oldIFS

array 3 will then contain lines that match in both arrays

uid=<ID>
uid=<ID>

However I am looking for something more along the line of

"uid=<ID>","<IP>","<hostname>",1
"uid=<ID>","<IP>","<hostname>",1

why not `"def",456,"world"` as well? Also what prevents you using `Array1`? — karakfa, Sep 25 '19 at 14:08
I mean, how exactly do you assign to the dummy arrays? It's not clear if you assume elements are comma or space separated. — Benjamin W., Sep 25 '19 at 14:21
@BenjaminW. I save the arrays from the values of variables that are declared earlier in my script array1=( $(echo $logdata ) ) ; array2=( $(echo $userIDs) ) — Sudosu0, Sep 25 '19 at 14:23
Can you edit the question with commands such that the arrays are created with the exact dummy data your question refers to? — Benjamin W., Sep 25 '19 at 14:27
Using `comm` for this seems weird. A common technique is to use Awk to join information from two files which contain different data but the same keys. A starting point is https://stackoverflow.com/questions/13272717/inner-join-on-two-text-files — tripleee, Sep 25 '19 at 14:47
Sorry if I wasn't clear. I meant you should edit the question to make it such that I can test here. Instead of `Array1{ "abc",123,"hello" "def",456,"world" }` something like `Array1=(abc,123,hello def,456,world)` (or should it be `(abc 123 hello def 456 world)`?). With your notation, I don't know what *really* is in the arrays. — Benjamin W., Sep 25 '19 at 14:51

Benjamin W. · Accepted Answer · 2019-09-25T15:19:19.673

I would do it like this:

join -t, \
    <(printf '%s\n' "${Array1[@]}" | sort -t, -k1,1) \
    <(printf '%s\n' "${Array2[@]}" | sort)

Use the join command with , as the field delimiter. The first "file" is the first array, one element per line, sorted on the first field (comma delimited); the second "file" is the second array, one element per line, sorted.

The output will be every line where the first element of the first file matches the element from the second file; for the example input it's

abc,123,hello

This makes only one assumption, namely that no array element contains a newline. To make it more robust (assuming GNU Coreutils), we can use NUL as the delimiter:

join -z -t, \
    <(printf '%s\0' "${Array1[@]}" | sort -z -t, -k1,1) \
    <(printf '%s\0' "${Array2[@]}" | sort -z)

This prints the output separated by NUL as well; to read the result into an array, we can use readarray:

readarray -d '' -t Array3 < <(
    join -z -t, \
        <(printf '%s\0' "${Array1[@]}" | sort -z -t, -k1,1) \
        <(printf '%s\0' "${Array2[@]}" | sort -z)
)

readarray -d requires Bash 4.4 or newer. For older Bash, you can use a loop:

while IFS= read -r -d '' element; do
    Array3+=("$element")
done < <(
    join -z -t, \
        <(printf '%s\0' "${Array1[@]}" | sort -z -t, -k1,1) \
        <(printf '%s\0' "${Array2[@]}" | sort -z)
)

Bayou · Answer 2 · 2019-09-25T14:57:18.703

I don't know how to do this with comm, but I do have a solution for you with sed and grep. The following commands match on the regex uid=X,, where the string/array is in the form of uid=x or (uid=x uid=y) respectively.

# Array 2 (B) is a string
$ A=("uid=1,10.10.10.1,server1,1" "uid=2,10.10.10.2,server2,1")
$ B="uid=1"
$ echo ${A[@]} | grep -oE "([^ ]*${B},[^ ]*)"
uid=1,10.10.10.1,server1,1

# Array 2 (D) is an array
$ C=(${A[@]} "uid=3,10.10.10.3,server3,1" "uid=4,10.10.10.4,server4,1")
$ D=(${B} "uid=3")
$ echo ${C[*]} | grep -oE "([^ ]*($(echo ${D[@]} | sed 's/ /,|/g'))[^ ]*)"
uid=1,10.10.10.1,server1,1
uid=3,10.10.10.3,server3,1

# Content of arrays
$ echo ${A[@]}
uid=1,10.10.10.1,server1,1 uid=2,10.10.10.2,server2,1
$ echo ${B}
uid=1
$ echo ${C[@]}
uid=1,10.10.10.1,server1,1 uid=2,10.10.10.2,server2,1 uid=3,10.10.10.3,server3,1 uid=4,10.10.10.4,server4,1
$ echo ${D[@]}
uid=1 uid=3

Using "comm" to find matches between two arrays

2 Answers2