How to find the difference between the values two fields from two files and print only if there is a difference >10 using shell

Question

Let say, i have two files a.txt and b.txt. the content of a.txt and b.txt is as follows:

a.txt:

abc|def|ghi|jfkdh|dfgj|hbkjdsf|ndf|10|0|cjhk|00|098r|908re|
dfbk|sgvfd|ZD|zdf|2df|3w43f|ZZewd|11|19|fdgvdf|xz00|00|00

b.txt:

abc|def|ghi|jfkdh|dfgj|hbkjdsf|ndf|11|0|cjhk|00|098r|908re|
dfbk|sgvfd|ZD|zdf|2df|3w43f|ZZewd|22|18|fdgvdf|xz00|00|00

So let's say these files have various fields separated by "|" and can have any number of lines. Also, assume that both are sorted files and so that we can match exact line between the two files. Now, i want to find the difference between the fields 8 & 9 of each row of each to be compared respectively and if any of their difference is greater than 10, then print the lines, otherwise remove the lines from file.

i.e., in the given example, i will subtract |10-11| (respective field no. 8 which is 1(absolute value) from a.txt and b.txt) and similarly for field no. 9 (0-0) which is 0,and both the difference is <10 so we delete this line from the files.

for the second line, the differences are (11-22)= 10 so we print this line.(dont need to check 19-18 as if any of the fields values(8,9) is >=10 we print such lines.

So the output is

a.txt:

dfbk|dfdag|sgvfd|ZD|zdf|2df|3w43f|ZZewd|11|19|fdgvdf|xz00|00|00

b.txt:

dfbk|dfdag|sgvfd|ZD|zdf|2df|3w43f|ZZewd|22|18|fdgvdf|xz00|00|00

So you mean you calculate the absolute difference between a.txt's field 8 and b.txt's field 8, and also the absolute difference between a.txt's field 9 and b.txt's field 9, and if either difference exceeds 10 you print the lines, else you remove them? — Mark Setchell, Apr 07 '14 at 10:26

score 3 · Answer 1 · edited Nov 13 '15 at 08:59

You can write bash shell script that does it:

while true; do
  read -r lineA <&3 || break
  read -r lineB <&4 || break

  vara_8=$(echo "$lineA" | cut -f8 -d "|")
  varb_8=$(echo "$lineB" | cut -f8 -d "|")
  vara_9=$(echo "$lineA" | cut -f9 -d "|")
  varb_9=$(echo "$lineB" | cut -f9 -d "|")

  if ((    vara_8-varb_8 > 10 || vara_8-varb_8 < -10
        || vara_9-varb_9 > 10 || vara_9-varb_9 < -10 )); then
    echo "$lineA" >> newA.txt
    echo "$lineB" >> newB.txt
  fi

done 3<a.txt 4<b.txt

score 3 · Answer 2 · answered Apr 07 '14 at 10:37

You can do this with awk:

awk -F\| 'FNR==NR{x[FNR]=$0;eight[FNR]=$8;nine[FNR]=$9;next} {d1=eight[FNR]-$8;d2=nine[FNR]-$9;if(d1>10||d1<-10||d2>10||d2<-10){print x[FNR] >> "newa";print $0 >> "newb"}}' a.txt b.txt

Explanation

The -F sets the field separator to the pipe symbol. The stuff in curly braces after FNR==NR applies only to the processing of a.txt. It says to save the whole line in array x[] indexed by line number (FNR) and also to save the eighth field in array eight[] also indexed by line number. Likewise field 9 is saved in array nine[].

The second set of curly braces applies to processing file b. It calculates the differences d1 and d2. If either exceeds 10, the line is printed to each of the files newa and newb.

score 0 · Answer 3 · edited May 23 '17 at 12:23

For short files

Use the method provided by Mark Setchell. Seen below in an expanded and slightly modified version:

parse.awk

FNR==NR { 
  x[FNR] = $0
  m[FNR] = $8
  n[FNR] = $9
  next
} 

{
  if(abs(m[FNR] - $8) || abs(n[FNR] - $9)) {
    print x[FNR] >> "newa"
    print $0     >> "newb"
  }
}

Run it like this:

awk -f parse.awk a.txt b.txt

For huge files

The method above reads a.txt into memory. If the file is very large, this becomes unfeasible and streamed parsing is called for.

It can be done in a single pass, but that requires careful handling of the multiplexed lines from a.txt and b.txt. A less error prone approach is to identify relevant line numbers, and then extract those into new files. An example of the last approach is shown below.

First you need to identify the matching lines:

# Extract fields 8 and 9 from a.txt and b.txt
paste <(awk -F'|' '{print $8, $9}' OFS='\t' a.txt) \
      <(awk -F'|' '{print $8, $9}' OFS='\t' b.txt) | 

# Check if it the fields matche the criteria and print line number
awk '$1 - $3 > n || $3 - $1 > n || $2 - $4 > n || $4 - $2 > 10 { print NR }' n=10 > linesfile

Now we are ready to extract the lines from a.txt and b.txt, and as the numbers are sorted, we can use the extract.awk script proposed here (repeated for convenience below):

extract.awk

BEGIN {
  getline n < linesfile
  if(length(ERRNO)) {
    print "Unable to open linesfile '" linesfile "': " ERRNO > "/dev/stderr"
    exit
  }
}

NR == n { 
  print
  if(!(getline n < linesfile)) {
    if(length(ERRNO))
      print "Unable to open linesfile '" linesfile "': " ERRNO > "/dev/stderr"
    exit
  }
}

Extract the lines (can be run in parallel):

awk -v linesfile=linesfile -f extract.awk a.txt > newa
awk -v linesfile=linesfile -f extract.awk b.txt > newb

How to find the difference between the values two fields from two files and print only if there is a difference >10 using shell

3 Answers3

For short files

For huge files