0

I want to have a function that reports all strings, across two arrays, that are identical to one another.

ARRAY_A=("table" "dog" "bird" "caterpillar")
ARRAY_B=("cup" "door" "table" "cat")

for VALUE_B in "${ARRAY_B[@]}"
do
  [[ "${ARRAY_A[@]}" =~ ${VALUE_B} ]] && echo ${VALUE_B} is in both arrays

This returns the output:

table is in both arrays
cat is in both arrays

The code is finding 'cat' in 'caterpillar' and returning that 'cat is in both arrays'. I only want the code to return 'cat is in both arrays' if the string is identical in both arrays ('cat':'cat').

While I could add in a nested for loop to go through and compare each value of ARRAY_A to each value in ARRAY_B, this solution would be inefficient on a larger scale. I believe there is a more efficient solution to this problem that I haven't realized yet.

Additionally,

I'd be willing to settle for a solution which checks if there are any strings in ARRAY_A that end with a value from ARRAY_B (bobcat:cat). This has led me to try out the following modification to the code above:

[[ "${ARRAY_A[@]}" =~ ${VALUE_B}$ ]] && echo $VALUE_B is in both arrays

However, it only compares the values of ARRAY_B to the end of ARRAY_A as a whole, not the individual values of ARRAY_A. Again, this could be solved with a nested for loop, but I believe there is a better solution that will work more effectively at scale.

Ken White
  • 123,280
  • 14
  • 225
  • 444
Voidheart
  • 3
  • 1
  • 3

1 Answers1

0

While I could add in a nested for loop to go through and compare each value of ARRAY_A to each value in ARRAY_B

Do that.

this could be solved with a nested for loop

So solve it.

this solution would be inefficient on a larger scale

The presented code improperly handles spaces in elements and depends on the value of IFS. In short, it is buggy.

Also improperly handles elements that contain each other, like A=(inside) B=(side). The common solution for that is to add spaces in front and after [[ " ${ARRAY_A[*]} " =~ " $VALUE_B " ]].

REGEX matching =~ multiple times in a loop and allocating memory for storing "${ARRAY_A[@]} multiple times are costly operations, that "at larger scale" will be very memory and CPU consuming. Follow rules of optimization and don't do it, then profile before doing it.

I stand that double loop will be definitely clearer, safer and less buggy, and even faster for "larger scale", as bash will not have to parse REGEX nor allocate memory for concatenation of elements. It is the way to go.

(For "larger scale" (in terms of gigabytes of data) I wouldn't use Bash arrays in the first place - I wouldn't want to store whole arrays in memory. I would work on files, using GNU tools. Assuming elements would be stored in files, I would sort array1 > array1sorted; sort array2 > array2sorted; comm -12 array1sorted array2sorted)

KamilCuk
  • 120,984
  • 8
  • 59
  • 111