11

EDIT

I read the question that this is supposed to be a duplicate of (this one). I don't agree. In that question the aim is to get the frequencies of individual numbers in the column. However if I apply that solution to my problem, I'm still left with my initial problem of grouping the frequencies of the numbers in a particular range into the final histogram. i.e. if that solution tells me that the frequency of 0.45 is 2 and 0.44 is 1 (for my input data), I'm still left with the problem of grouping those two frequencies into a total of 3 for the range 0.4-0.5.

END EDIT

QUESTION-

I have a long column of data with values between 0 and 1. This will be of the type-

0.34
0.45
0.44
0.12
0.45
0.98
.
.
.

A long column of decimal values with repetitions allowed.

I'm trying to change it into a histogram sort of output such as (for the input shown above)-

0.0-0.1  0
0.1-0.2  1
0.2-0.3  0
0.3-0.4  1 
0.4-0.5  3
0.5-0.6  0
0.6-0.7  0
0.7-0.8  0
0.8-0.9  0
0.9-1.0  1

Basically the first column has the lower and upper bounds of each range and the second column has the number of entries in that range.

I wrote it (badly) as-

for i in $(seq 0 0.1 0.9)
do 
    awk -v var=$i '{if ($1 > var && $1 < var+0.1 ) print $1}' input | wc -l; 
done

Which basically does a wc -l of the entries it finds in each range.

Output formatting is not a part of the problem. If I simply get the frequencies corresponding to the different bins , that will be good enough. Also please note that the bin size should be a variable like in my proposed solution.

I already read this answer and want to avoid the loop. I'm sure there's a much much faster way in awk that bypasses the for loop. Can you help me out here?

Community
  • 1
  • 1
Chem-man17
  • 1,700
  • 1
  • 12
  • 27

3 Answers3

16

Following the same algorithm of my previous answer, I wrote a script in awk which is extremely fast (look at the picture). enter image description here

The script is the following:

#!/usr/bin/awk -f

BEGIN{
    bin_width=0.1;
    
}
{
    bin=int(($1-0.0001)/bin_width);
    if( bin in hist){
        hist[bin]+=1
    }else{
        hist[bin]=1
    }
}
END{
    for (h in hist)
        printf " * > %2.2f  ->  %i \n", h*bin_width, hist[h]
}
   

The bin_width is the width of each channel. To use the script just copy it in a file, make it executable (with chmod +x <namefile>) and run it with ./<namefile> <name_of_data_file>.

Riccardo Petraglia
  • 1,943
  • 1
  • 13
  • 25
13

For this specific problem, I would drop the last digit, then count occurrences of sorted data:

cut -b1-3 | sort | uniq -c

which gives, on the specified input set:

  2 0.1
  1 0.3
  3 0.4
  1 0.9

Output formatting can be done by piping through this awk command:

| awk 'BEGIN{r=0.0}
       {while($2>r){printf "%1.1f-%1.1f %3d\n",r,r+0.1,0;r=r+.1}
       printf "%1.1f-%1.1f %3d\n",$2,$2+0.1,$1}
       END{while(r<0.9){printf "%1.1f-%1.1f %3d\n",r,r+0.1,0;r=r+.1}}'
mouviciel
  • 66,855
  • 13
  • 106
  • 140
  • 1
    Haha, points for creativity but that's not a very good answer. With this logic of defining the bins, I'm stuck with having to do it at 0.1 intervals. If I want to do it at 0.05 intervals (having 20 bins instead of 10) then it won't be possible to change. – Chem-man17 Sep 22 '16 at 09:29
  • 3
    This is why I started my answer with _For this specific problem_. Bin size was not variable in the question. – mouviciel Sep 22 '16 at 09:34
  • Yup, you're right. Thanks for this answer which is very clever. But won't really help in the general case. – Chem-man17 Sep 22 '16 at 09:34
  • Doesn't work for floats, but a very useful answer for many other use cases, thanks! – David Parks Dec 14 '21 at 16:08
4

The only loop you will find in this algorithm is around the line of the file.

This is an example on how to realize what you asked in bash. Probably bash is not the best language to do this since it is slow with math. I use bc, you can use awk if you prefer.

How the algorithm works

Imagine you have many bins: each bin correspond to an interval. Each bin will be characterized by a width (CHANNEL_DIM) and a position. The bins, all together, must be able to cover the entire interval where your data are casted. Doing the value of your number / bin_width you get the position of the bin. So you have just to add +1 to that bin. Here a much more detailed explanation.

#!/bin/bash

# This is the input: you can use $1 and $2 to read input as cmd line argument
FILE='bash_hist_test.dat'
CHANNEL_NUMBER=9  # They are actually 10: 0 is already a channel

# check the max and the min to define the dimension of the channels:
MAX=`sort -n $FILE | tail -n 1`
MIN=`sort -rn $FILE | tail -n 1`

# Define the channel width 
CHANNEL_DIM_LONG=`echo "($MAX-$MIN)/($CHANNEL_NUMBER)" | bc -l` 
CHANNEL_DIM=`printf '%2.2f' $CHANNEL_DIM_LONG `
# Probably printf is not the best function in this context because
#+the result could be system dependent.

# Determine the channel for a given number
# Usage: find_channel <number_to_histogram> <width_of_histogram_channel>
function find_channel(){
  NUMBER=$1
  CHANNEL_DIM=$2

  # The channel is found dividing the value for the channel width and 
  #+rounding it.
  RESULT_LONG=`echo $NUMBER/$CHANNEL_DIM | bc -l`
  RESULT=`printf '%.0f' $RESULT_LONG`
  echo $RESULT
}

# Read the file and do the computuation
while IFS='' read -r line || [[ -n "$line" ]]; do

  CHANNEL=`find_channel $line $CHANNEL_DIM`

  [[ -z HIST[$CHANNEL] ]] && HIST[$CHANNEL]=0
  let HIST[$CHANNEL]+=1
done < $FILE

counter=0
for i in ${HIST[*]}; do
  CHANNEL_START=`echo "$CHANNEL_DIM * $counter - .04" | bc -l`
  CHANNEL_END=`echo " $CHANNEL_DIM * $counter + .05" | bc`
  printf '%+2.1f : %2.1f => %i\n' $CHANNEL_START $CHANNEL_END $i
  let counter+=1
done

Hope this helps. Comment if you have other questions.

Riccardo Petraglia
  • 1,943
  • 1
  • 13
  • 25
  • Thanks for your well commented script. I didn't understand what your variables `$NUMBER` and `$CHANNEL_DIM` are representing. – Chem-man17 Sep 22 '16 at 08:34
  • @VarunM I tried to explain it a little bit... but the best way to understand how it works is testing it and check the output ad different points. – Riccardo Petraglia Sep 22 '16 at 08:50
  • Your script works for the example case (short input) but takes **way** too long for my actual input which has about half a million entries. You're right maybe that bash is not the best language for this. Thanks for your answer though! +1 since it works for my example in the question. – Chem-man17 Sep 22 '16 at 09:27
  • @VarunM New script in awk. The only way to get something faster is using c to my knowledge... ;) – Riccardo Petraglia Sep 22 '16 at 11:15