0

I have a file with two columns, X and Y positive, and non-gridded, data points (> 10^5 points).

1     0.9
0.9   1.1
0.5   1.25
2.6   0.9
3.1   2.6
2.9   2.55
4.1   0.9
1.2   6
5.5   2.5
6     4
4     7.2
.     .
.     .

I would like to generate an X-Y grid (of size binsize) in a selected range of those points. Besides, I would like to add a third column indicating the count of the original data points contained in a square area (binsize x binsize) of each of the vertices of the grid.

If binsize=5

2.5    2.5   7 
2.5    7.5   2
7.5    2.5   2
.       .    .
.       .    .

I would like to pass to the AWK program the range of data and the binsize.

I would appreciate your help very much.

EDIT:

The binsize is to determine the range of values in which I have to count the XY datapoints. The range input is to select the x and y values to count, for example, If I select x in [0,5] and y in [0,5] then I only will consider the binning of the first eight xy points. My real dataset is very big

Kaushik Nayak
  • 30,772
  • 5
  • 32
  • 45
pablo
  • 385
  • 2
  • 14
  • you did not post the input sample an expected result, why? – RomanPerekhrest Sep 11 '17 at 16:11
  • @RomanPerekhrest Sorry, I have edited the post to add an example. Thanks. – pablo Sep 11 '17 at 16:39
  • what is the significance of binsize here and how does it effect your result..? – Kaushik Nayak Sep 12 '17 at 15:21
  • @KaushikNayak If I reduce by 2 the binsize, for example to 2.5, we double the number of rows in the result file and the third column will have lower values as we are counting less number of points into the bin. – pablo Sep 12 '17 at 15:41
  • It is not clear. Can you post more examples for different binsizes and the output expected which i can relate with the data points in the source file?. Also why would the awk program need binsize as an argument. won't the range just be sufficient? – Kaushik Nayak Sep 12 '17 at 15:49
  • @KaushikNayak I apologize if it is not clear enough. The binsize is to determine the range of values in which I have to count the XY datapoints. The range input is to select the x and y values to count, for example, If I select x in [0,5] and y in [0,5] then I only will consider the binning of the first eight xy points. My real dataset is very big – pablo Sep 12 '17 at 16:16
  • @KaushikNayak In the post https://stackoverflow.com/questions/18089752/r-generate-2d-histogram-from-raw-data question is similar to this, but they solve the problem with R. I would like an AWK program because it is in the system and I can call it from Gnuplot. – pablo Sep 12 '17 at 16:26
  • @KaushikNayak Gnuplot does not have any way to make the count of a dataset and generate the image plot. Gnuplot needs an array of the values to plot or the XYZ. – pablo Sep 12 '17 at 16:49
  • Does the 1-d example in my answer here help? https://stackoverflow.com/a/46129480/2836621 – Mark Setchell Sep 13 '17 at 19:43
  • @MarkSetchell Is the way to bin 1-d but I can not figure out how to make apply this to 2D points. Thank you. – pablo Sep 13 '17 at 20:34
  • So you would need to pass to `awk` the following 5 parameters... `xmin, xmax, ymin, ymax, binsize`? I am unclear how you deduce you will only look at 8 points though? – Mark Setchell Sep 13 '17 at 20:45
  • If your x-range was [0,20] and your y-range [0,30] and your binsize was 5, how many rows of output would you expect? 24? What would the numbers be in the first couple of rows? Please edit your question and paste answer there rather than in comments. – Mark Setchell Sep 13 '17 at 20:53

1 Answers1

2

I think a solution could look something like this:

awk -v binsize=0.5 -v xmin=0 -v xmax=3 -v ymin=2 -v ymax=4 '
BEGIN {
   # Calculate number of x-bins and number of y-bins
   nx=int((xmax-xmin)/binsize)
   ny=int((ymax-ymin)/binsize)
   # Pre-zero all bins else empty entries will not show up in output
   for(x=0;x<nx;x++){
      for(y=0;y<ny;y++){
         output[x,y]=0
      }
   }
}

{
   # pick up x and y
   x=$1; y=$2

   # if this sample within x-range and y-range
   if(x>=xmin && x<=xmax && y>=ymin && y<=ymax){
      xindex=int((x-xmin)/binsize)
      yindex=int((y-ymin)/binsize)
      output[xindex,yindex]++;
      printf("DEBUG: x=%f, y=%f (line %d)\n",x,y,NR);
      printf("DEBUG: Incrementing bin [%d][%d]\n",xindex,yindex);
   }
}

END{
   # Print results
   for(x=0;x<nx;x++){
      for(y=0;y<ny;y++){
         printf("%d\t",output[x,y]);
      }
      printf("\n");
   }
} ' points.txt

And using this as input:

0.4   2.1
0.39  2.02
0.1   2.4
1     0.9
0.9   1.1
0.5   1.25
2.6   0.9
3.1   2.6
2.9   2.55

You get this as output:

DEBUG: x=0.400000, y=2.100000 (line 1)
DEBUG: Incrementing bin [0][0]
DEBUG: x=0.390000, y=2.020000 (line 2)
DEBUG: Incrementing bin [0][0]
DEBUG: x=0.100000, y=2.400000 (line 3)
DEBUG: Incrementing bin [0][0]
DEBUG: x=2.900000, y=2.550000 (line 9)
DEBUG: Incrementing bin [5][1]
3   0   0   0   
0   0   0   0   
0   0   0   0   
0   0   0   0   
0   0   0   0   
0   1   0   0
Mark Setchell
  • 191,897
  • 31
  • 273
  • 432
  • Your solution is perfect, is the result I was looking. Sorry if my explanation of the problem was confusing. I am very grateful. – pablo Sep 14 '17 at 21:50
  • Gnuplot requires this type of matrix to plot intensity or heat maps. There was this missing part if one wants to plot from only the scattered dataset. R or Python already can make it but Gnuplot needs this calculation to be done through an external program. AWK is perfect. Regards. – pablo Sep 14 '17 at 21:58