2

I have an unsorted data set of two columns with most of the points aligning diagonally along y=x, however some points misalign.

I would like to show that most of the points actually do align along the function, however just pointplotting would just overlap the over-represented points to one. The viewer would then get the impression that the data points are actually scattered randomly because there is no weight to the occurrence count.

Is there a way to implement a weight to the points that occur more than once - maybe through point size? Couldnt find anything on this topic.

Thanks a lot!

John Titor
  • 461
  • 3
  • 13

1 Answers1

1

You don't show data so I assumed something from your description. As @Christoph already mentioned you could use jitter or transparency to indicate that there are many more datapoints more or less at the same location. However, transparency is limited to 256 values (actually, 255 because fully transparent you won't see). So, in extreme case, if you have more than 255 points on top of each other you won't see a difference to 255 points on top of each other.

Basically, you're asking for displaying the density of points. This reminds me to this question: How to plot (x,y,z) points showing their density

In the example below a "pseudo" 2D-histogram is created. I'm not aware of 2D-histograms in gnuplot, so you have to do it as 1D-histogram mapping it onto 2D. You divide the plot into fields and count the occurrence of point in each field. This number you use either for setting the point variable color via palette or for variable pointsize.

The code example will generate 5 ways to plot the data:

  1. solid points
  2. empty points
  3. transparent points
  4. colored points
  5. sized points (your question)

I leave it up to you to judge which way is suitable. Certainly it will depend pretty much on the data and your special case.

Code:

### different ways to show density of datapoints 
reset session

# create some random test data
set print $Data
    do for [i=1:1000] {
        x=invnorm(rand(0))
        y=x+invnorm(rand(0))*0.05
        print sprintf("%g %g",x,y)
    }
    do for [i=1:1000] {
        x=rand(0)*8-4
        y=rand(0)*8-4
        print sprintf("%g %g",x,y)
    }
set print

Xmin=-4.0; Xmax=4.0
Ymin=-4.0; Ymax=4.0
BinXSize = 0.1
BinYSize = 0.1
BinXCount = int((Xmax-Xmin)/BinXSize)+1
BinYCount = int((Ymax-Ymin)/BinYSize)+1
BinXNo(x) = floor((x-Xmin)/BinXSize)
BinYNo(y) = floor((y-Ymin)/BinYSize)
myBinNo(x,y) = (_tmp =BinYNo(y)*BinXCount + BinXNo(x), \
                _tmp < 0 || _tmp > BinXCount*BinYCount-1 ?  NaN : int(_tmp+1))

# get data into 1D histogram
set table $Bins
    plot [*:*][*:*] $Data u (myBinNo($1,$2)):(1) smooth freq
unset table

# initialize array all values to 0
array BinArr[BinXCount*BinYCount]
do for [i=1:BinXCount*BinYCount] { BinArr[i] = 0 }
# get histogram values into array
set table $Dummy
    plot myMax=NaN $Bins u ($2<myMax?0:myMax=$2, BinArr[int($1)] = int($2)) w table
unset table

myBinValue(x,y) = (_tmp2 = myBinNo(x,y), _tmp2>0 && _tmp2<=BinXCount*BinYCount ? BinArr[_tmp2] : NaN)

# point size settings
myPtSizeMin = 0.0
myPtSizeMax = 2.0
myPtSize(x,y) = myBinValue(x,y)*(myPtSizeMax-myPtSizeMin)/myMax*myPtSizeMax + myPtSizeMin

set size ratio -1
set xrange [Xmin:Xmax]
set yrange [Ymin:Ymax]
set key top center out opaque box

set multiplot layout 2,3

    plot $Data u 1:2 w p pt 7 lc "red"           ti "solid points"
    plot $Data u 1:2 w p pt 6 lc "red"     ti "empty points"
    plot $Data u 1:2 w p pt 7 lc "0xeeff0000"    ti "transparent points"
    set multiplot next
    plot $Data u 1:2:(myBinValue($1,$2)) w p pt 7 ps 0.5 palette z  ti "colored points"
    plot $Data u 1:2:(myPtSize($1,$2)) w p pt 7 ps var lc "web-blue" ti "sized points"

unset multiplot
### end of code

Result:

enter image description here

theozh
  • 22,244
  • 5
  • 28
  • 72