3

I have a large dataset that looks like this:

5 6 5 6 3 5
2 5 3 7 1 6
4 8 1 8 6 9
1 5 2 9 4 5

For every line, I want to subtract the first field from the second, third from fourth and so on deepening on the number of fields (always even). Then, I want to report those lines for which difference from all the pairs exceeds a certain limit (say 2). I should also be able to report next best lines i.e., lines in which one pairwise comparison fails to meet the limit, but all other pairs meet the limit.

from the above example, if I set a limit to 2 then, my output file should contain best lines:

2 5 3 7 1 6    # because (5-2), (7-3), (6-1) are all > 2
4 8 1 8 6 9    # because (8-4), (8-1), (9-6) are all > 2 

next best line(s)

1 5 2 9 4 5    # because except (5-4), both (5-1) and (9-2) are > 2

My current approach is to read every line, save each field as a variable, do subtraction. But I don't know how to proceed further.

Thanks,

arnstrm
  • 379
  • 3
  • 13
  • 3
    You say that the number of fields is always even, but the example has an odd number of fields. – user295691 Nov 09 '12 at 17:19
  • 1
    Could you elaborate more on *"I should also be able to report next best lines"*? And could you provide a representative input and output? – Shawn Chin Nov 09 '12 at 17:20
  • please edit your question to include workable sample input and expected sample output AND any code you have tried and error messages. Good luck. – shellter Nov 09 '12 at 17:21
  • Thanks for all the input, I have made necessary changes. – arnstrm Nov 09 '12 at 17:58

5 Answers5

3

Here's a bash-way to do it:

#!/bin/bash

threshold=$1
shift
file="$@"

a=($(cat "$file"))
b=$(( ${#a[@]}/$(cat "$file" | wc -l) ))

for ((r=0; r<${#a[@]}/b; r++)); do
    br=$((b*r))
    for ((c=0; c<b; c+=2)); do

        if [[ $(( ${a[br + c+1]} - ${a[br + c]} )) < $threshold ]]; then
            break; fi

        if [[ $((c+2)) == $b ]]; then
            echo ${a[@]:$br:$b}; fi

    done
done

Usage:

$ ./script.sh 2 yourFile.txt
2 5 3 7 1 6
4 8 1 8 6 9

This output can then easily be redirected:

$ ./script.sh 2 yourFile.txt > output.txt

NOTE: this does not work properly if you have those empty lines between each line...But I'm sure the above will get you well on your way.

Rody Oldenhuis
  • 37,726
  • 7
  • 50
  • 96
3

Prints "best" lines to the file "best", and prints "next best" lines to the file "nextbest"

awk '
{
        fail_count=0
        for (i=1; i<NF; i+=2){
                if ( ($(i+1) - $i) <= threshold )
                        fail_count++
        }
        if (fail_count == 0)
                print $0 > "best"
        else if (fail_count == 1)
                print $0 > "nextbest"
}
' threshold=2 inputfile

Pretty straightforward stuff.

  1. Loop through fields 2 at a time.
  2. If (next field - current field) does not exceed threshold, increment fail_count
  3. If that line's fail_count is zero, that means it belongs to "best" lines.

    Else if that line's fail_count is one, it belongs to "next best" lines.

doubleDown
  • 8,048
  • 1
  • 32
  • 48
  • Hi doubleDown, I tried this solution, but it returns me this following error. awk: compare_f2.sh:2: awk ' awk: compare_f2.sh:2: ^ invalid char ''' in expression – arnstrm Nov 12 '12 at 15:43
  • 1
    Beats me. I copied the codes to a script, ran against the sample input, and it works fine. Searching up on the error message would probably help you debug this. – doubleDown Nov 13 '12 at 08:40
  • Sorry, it works great. I was making some stupid mistakes, that's all. I have already up voted and selected it as the best answer. Thanks very much! – arnstrm Nov 13 '12 at 13:56
1

I probably wouldn't do that in bash. Personally, I'd do it in Python, which is generally good for those small quick-and-dirty scripts.

If you have your data in a text file, you can read here about how to get that data into Python as a list of lines. Then you can use a for-loop to process each line:

threshold = 2
results = []
for line in content:
    numbers = [int(n) for n in line.split()] # Split it into a list of numbers
    pairs = zip(numbers[::2],numbers[1::2]) # Pair up the numbers two and two.
    result = [abs(y - x) for (x,y) in pairs] # Subtract the first number in each pair from the second.
    if sum(result) > threshold:
        results.append(numbers)
Community
  • 1
  • 1
Tayacan
  • 1,896
  • 11
  • 15
1

Yet another bash version:

First a check function that return nothing but a result code:

function getLimit() {
    local pairs=0 count=0 limit=$1 wantdiff=$2
    shift 2
    while [ "$1" ] ;do
        [ $(( $2-$1 )) -ge $limit ] && : $((count++))
        : $((pairs++))
        shift 2
      done
    test $((pairs-count)) -eq $wantdiff
}

than now:

while read line ;do getLimit 2 0 $line && echo $line;done <file
2 5 3 7 1 6
4 8 1 8 6 9

and

while read line ;do getLimit 2 1 $line && echo $line;done <file
1 5 2 9 4 5
F. Hauri - Give Up GitHub
  • 64,122
  • 17
  • 116
  • 137
  • I think I am doing something wrong. I am unable to get any output from this function (prompt just blinks till I enter Ctrl+c). I included the getLimit in my .bashrc file, restarted the terminal and typed the above command. – arnstrm Nov 12 '12 at 15:41
  • @asurarocks: I'm sorry, forgot `shift 2` when I copied my solution! – F. Hauri - Give Up GitHub Nov 12 '12 at 18:35
  • Yes, It works now! very elegant method, but awk was much faster! Thanks very much for the answer. – arnstrm Nov 12 '12 at 19:02
0

If you can use awk

$ cat del1
5 6 5 6 3 5
2 5 3 7 1 6
4 8 1 8 6 9
1 5 2 9 4 5
1 5 2 9 4 5 3 9

$ cat del1 | awk '{
> printf "%s _ ",$0; 
> for(i=1; i<=NF; i+=2){
>     printf "%d ",($(i+1)-$i)}; 
>     print NF 
> }' | awk '{
> upper=0; 
> for(i=1; i<=($NF/2); i++){ 
>     if($(NF-i)>threshold) upper++
> }; 
> printf "%d _ %s\n", upper, $0}' threshold=2 | sort -nr
3 _ 4 8 1 8 6 9 _ 4 7 3 6
3 _ 2 5 3 7 1 6 _ 3 4 5 6
3 _ 1 5 2 9 4 5 3 9 _ 4 7 1 6 8
2 _ 1 5 2 9 4 5 _ 4 7 1 6
0 _ 5 6 5 6 3 5 _ 1 1 2 6

You can process result further according to your needs. The result is sorted by ‘goodness’ order.

plhn
  • 5,017
  • 4
  • 47
  • 47