2

I have a simple 9 column file. I wan't to compute certain statistics for each column and then plot it (using gnuplot).

1) This is how I compute statistics for every column excluding the first one.

stats 'data' every ::2 name "stats"

2) In the output screen I can see that the operation is successful. Note that the number of columns/records is 8

* FILE: 
  Records:      8
  Out of range: 0
  Invalid:      0
  Blank:        0
  Data Blocks:  1

* COLUMNS:
  Mean:          6.5000       491742.6625
  Std Dev:       2.2913          703.4865
  Sum:          52.0000       3.93394e+06
  Sum Sq.:     380.0000       1.93449e+12

  Minimum:       3.0000 [0]   490312.0000 [2]
  Maximum:      10.0000 [7]   492643.5000 [7]
  Quartile:      4.5000       491329.5000
  Median:        6.5000       491911.1500
  Quartile:      8.5000       492252.2500

  Linear Model: y = 121.8 x + 4.91e+05
  Correlation:  r = 0.3966
  Sum xy:       2.558e+07

3) Now I can access statistics on the first 2 columns by appending _x and _y like this

print stats_median_x
print stats_median_y

My questions are:

  • How can I access statistics (lets say medians) for the remaining 6 columns?
  • How could I plot lets say a line over all medians against some X axis?

I know that I can simply add a python script to pre-compute all this, but I would prefer to avoid it if there is an easy way to do it using gnuplot itself.

Thanks!

kirbo
  • 1,707
  • 5
  • 26
  • 32

1 Answers1

7

Short answer(s)

  • "How can I access statistics of the other column?"
    with stats 'data'using n you will access to the nth column...
  • "How can I plot for example all medians?"
    e.g. a set print and a do for cycle can create a data-file that you can use for the plot.

A working solution

    set print "StatDat.dat" 
    do for [i=2:9] { # Here you will use i for the column.
      stats  'data.dat' u i nooutput ; 
      print i, STATS_median, STATS_mean , STATS_stddev # ...
    } 
    set print
    plot "StatDat.dat" us 1:2 # or whatever column you want...

Some words more about it
Asking help to gnuplot itself with help stats it's possible to read a lot of interesting things :-).

Syntax:
stats 'filename' [using N[:M]] [name 'prefix'] [[no]output]]
This command prepares a statistical summary of the data in one or two columns of a file. The using specifier is interpreted in the same way as for plot commands. See plot for details on the index, every, and using directives.

  • From the first highlighted sentence we can understand that it prepares statistics for one or maximum two column each time (It's a pity let's see in future...).
  • From the second highlighted sentence it's possible to read that it will follow the same syntax of the plot command:
    so stats 'data'using 3 will give you the statistic of the 3rd column in x
    and stats 'data' using 4:5 of the 4th and 5th in x,y...

Notes about your interpretations

  1. You said

    This is how I compute statistics for every column excluding the first one.
    stats 'data' every ::2 name "stats"

    Not really this is the statistic for the first two column excluding the first two lines, indeed their counter starts from 0 and not from 1.

  2. As consequence of the above assumption/interpretation, when we read

    Records: 8

    it means that the lines computed where 8; your file had 10 (usable) lines, you specify every ::2 and you skip the first two, thus you have 8 records useful for the statistic.
    Indeed so we can better understand when in help stats it is said

    STATS_records           # total number of in-range data records
    

    implying "used to compute this statistic".

Tested on gnuplot 4.6 patchlevel 4
Working on gnuplot Version 5.0 patchlevel 1

Hastur
  • 2,470
  • 27
  • 36
  • Thanks! I was under the wrong assumption that "stats" was able to process all the columns in a file in a single call. Work around is very helpful! – kirbo Nov 18 '15 at 06:27
  • You're welcome. It should be useful, but it seems it is still not so. Btw if your files are longer than 10 lines (it could happen that we start with 10 lines and we finish with 1 Million ) __you may want to halve the execution time using two column for each pass__ and printing x and y on two subsequent lines... In the example I preferred to remain more plane. With another trick you can avoid to dump the statistical data in a temporary file... but for sake of saneness it's better to remain plain, especially if you will reuse the code after some time. Memory is limited, at least mine :-) – Hastur Nov 18 '15 at 07:46
  • @starfry ...but this is another question! `;-)` Since you work in a _Linux like_ environment you can use `firstrow = system('head -1 '.data)` then `print word(firstrow, i), STATS_median, STATS_mean , STATS_stddev # ...` in the `do` cycle... – Hastur Jul 28 '18 at 16:29