How to nested loop 2 files then compare columns without using `while read`?

Question

There are 2 files has same structure.Output difference of each other columns, based on same value of specific column.

##!/bin/bash
set -e

result_dir='/home/folder1'

#2 test files
cat << EOF > $result_dir/old
1 a /home
5 b /home/me
6 e /home/me/file 2
3 c /home/oth
EOF
cat << EOF > $result_dir/new
1 a /home
4 b /home/me
6 f /home/me/file 2
5 c /home/oth/file
EOF

#loop
changed=()
while read -r -u 5 OWNER GROUP FOLDER; do
    temp=''
    while read -r -u 6 OWNER_NEW GROUP_NEW FOLDER_NEW; do
    #exist in both old & new
    if [[ "$FOLDER" == "$FOLDER_NEW" ]]; then
        temp+=$FOLDER
        if [[ $OWNER != $OWNER_NEW ]]; then
        temp+=$sep$OWNER_NEW
        else
        temp+=$sep
        fi
        if [[ $GROUP != $GROUP_NEW ]]; then
        temp+=$sep$GROUP_NEW
        else
        temp+=$sep       
        fi
        #changed?
        if [[ "$(echo -e "${temp}" | sed -e 's/[[:space:]]*$//')" != $FOLDER ]] ; then
        changed+=($temp)
        fi
        break
    fi
    done 6<$result_dir/acl_folder_new

#old loop
done 5<$result_dir/acl_folder_old

echo -e "${changed[@]}"

Output as below:

/home/me 4
/home/me/file 2  f

Everything is ok, but speed too slow when file contain more than 10000+ lines as other post1,post2

How to nested loop 2 files then compare columns without using while read?

I haven't tried to understand the logic of how the output is generated but generally speaking ... the current process is slow because of two main issues: **1)** repeated reads of the inner file and **2)** the sheer volume of OS-level commands/calls/sub-processing; it seems (to me) you might use `join` to pull the 2 files together into one and from there process the results with a single `while` loop; another option would be to push all of his logic into a single `awk` call, though you'll want to watch memory usage (predominantly for `awk` array storage); having said that ... — markp-fuso, Jul 11 '21 at 17:47
it would help if you could update the question to provide a description of what you're trying to accomplish — markp-fuso, Jul 11 '21 at 17:49
@markp-fuso,already update question.Yes,I am looking for single `awk` call to faster speed. — kittygirl, Jul 11 '21 at 17:59

score 2 · Accepted Answer · answered Jul 11 '21 at 19:11

The term "folder" is from Windows. In Unix the equivalent is a "directory". The following will accommodate spaces in your directory names (as you have in your sample input with /home/me/file 2 but that's not adequate to test that a given script accommodates it) and will work using any awk in any shell on every Unix box:

$ cat tst.sh
#!/usr/bin/env bash

result_dir='/home/directory1'
mkdir -p "$result_dir" || exit

#2 test files
cat << EOF > "$result_dir/old"
1 a /home
5 b /home/me
6 e /home/me/file 2
3 c /home/oth
EOF

cat << EOF > "$result_dir/new"
1 a /home
4 b /home/me
6 f /home/me/file 2
5 c /home/oth/file
EOF

awk '
{
    match($0,/^([^ ]+ ){2}/)
    dir = substr($0,RLENGTH+1)
    $0 = substr($0,1,RLENGTH-1)
}
NR==FNR {
    olds[dir] = $0
    next
}
dir in olds {
    split(olds[dir],old)
    for (i=1; i<=NF; i++) {
        if ($i != old[i]) {
            print dir, $i
        }
    }
}
' "$result_dir/old" "$result_dir/new"

$ ./tst.sh
/home/me 4
/home/me/file 2 f

markp-fuso · Answer 2 · 2021-07-11T18:47:49.787

UPDATE: OP has recently commented that only mawk is available; I don't have access to mawk so not sure if the following is going to work ...

Assumptions:

two input files: old and new
both files have 3 fields (space delimited) we'll label user, group and folder
while sample data shows single character user and group values, will assume these could be multi-character
user and group fields do not contain white space
the folder field can contain white space (eg, /home/me/file 2)

Objectives:

if field #3 (folder) has a match in both files and ...
the user and/or group is different then ...
print the folder name and the field(s) that are different from the new file; format: folder [user(new)] [group(new)]

Sample data:

$ cat old
1 a /home
5 b /home/me
6 e /home/me/file 2
3 c /home/oth
9 X /home/both/are/diff er ent

$ cat new
1 a /home
4 b /home/me                                 # user is different
6 f /home/me/file 2                          # group is different
5 c /home/oth/file
124 long_group /home/both/are/diff er ent    # user and group are different

NOTE: comments in file new don't exist; only added here to highlight what should be flagged as different

One awk idea:

awk '

FNR==NR { folder=""                             # file #1 processing
          for (i=3; i<NF; i++)
              folder=folder $(i) OFS
          folder=folder $(NF)

          user[folder]=$1
          group[folder]=$2
          next
        }

        { folder=""                             # file #2 processing
          for (i=3; i<NF; i++)
              folder=folder $(i) OFS
          folder=folder $(NF)

          output=folder

          if (folder in user) {
              if ( $1 != user[folder]  ) output=output OFS $1
              if ( $2 != group[folder] ) output=output OFS $2
          }

          if ( output != folder )        print output
        }
' old new

This generates:

/home/me 4
/home/me/file 2 f
/home/both/are/diff er ent 124 long_group

Regarding `I don't have access to mawk so not sure if the following is going to work` - that script will behave the same in any awk. — Ed Morton, Jul 12 '21 at 13:32

score 1 · Answer 3 · answered Jul 11 '21 at 20:44

This is why you want to avoid nested loops: for every line in acl_folder_old you read and process the entire file acl_folder_new. If both lines have 10,000 lines, then you're reading 100,010,000 (= 10,000 + 10,000 * 10,000) lines in total -- plus you're launching sed a hundred million times. If you read each file only once, then you're reading a total of 20,000 lines. You're right to reach for an awk solution.

awk will be faster than bash, but here's a bash solution for comparison. This requires bash 4.0+ for associative arrays:

#!/usr/bin/env bash

declare -A owners
declare -A groups

while read -r owner group path; do
    owners[$path]=$owner
    groups[$path]=$group
done < "$result_dir/acl_folder_old"

while read -r owner group path; do
    new_owner=""; new_group=""
    if [[ -n ${owners[$path]} ]]; then
        [[ $owner != "${owners[$path]}" ]] && new_owner=$owner
        [[ $group != "${groups[$path]}" ]] && new_group=$group
        if [[ -n $new_owner || -n $new_group ]]; then
            # using semicolon as the sep char
            printf '%s;%s;%s\n' "$path" "$new_owner" "$new_group"
        fi
    fi
done < "$result_dir/acl_folder_new"

output

/home/me;4;
/home/me/file 2;;f

Zombo · Answer 4 · 2021-07-11T18:27:35.727

You could just use GAWK instead:

BEGIN {
   while (getline < "old.txt") {
      owner = $1
      group = $2
      folder = $3
      old[folder]["owner"] = owner
      old[folder]["group"] = group
   }
   while (getline < "new.txt") {
      owner = $1
      group = $2
      folder = $3
      if (folder in old) {
         if (owner != old[folder]["owner"] || group != old[folder]["group"]) {
            print
         }
      }
   }
}

or PHP:

<?php

foreach (file('old.txt', FILE_IGNORE_NEW_LINES) as $r) {
   $c = explode(' ', $r);
   $folder = $c[2];
   $old[$folder]['owner'] = $c[0];
   $old[$folder]['group'] = $c[1];
}

foreach (file('new.txt', FILE_IGNORE_NEW_LINES) as $r) {
   $c = explode(' ', $r);
   $owner = $c[0];
   $group = $c[1];
   $folder = $c[2];
   if (key_exists($folder, $old)) {
      if ($owner != $old[$folder]['owner'] || $group != $old[$folder]['group']) {
         echo $r, "\n";
      }
   }
}

`while (getline < "old.txt")` would spin off into an infinite loop if there was a problem opening `old.txt`. It's also the opposite of idiomatic awk to write while-read loops in the BEGIN section to read input and then not have a main body in the script as reading input is what awk does by default in the main body of the script. See http://awk.freeshell.org/AllAboutGetlinefor when and how to call `getline`. — Ed Morton, Jul 11 '21 at 18:50

How to nested loop 2 files then compare columns without using `while read`?

4 Answers4