1

I have csv file, I want to create histogram from column 6. Using Linux utilities this is simple:

└──> cut -f6 -d, data.csv | sort | uniq -c | sort -k2,2n
    563 0.0
     72 0.025
     35 0.05
     22 0.075
     14 0.1
     21 0.125
     14 0.15
     10 0.175
      5 0.2
      3 0.225
      7 0.25
      3 0.275
      6 0.3
      5 0.325
      3 0.35
      1 0.375
      3 0.4
      1 0.425
      3 0.45
      3 0.475
      5 0.5
      7 0.525
     11 0.55
      3 0.575
      4 0.6
      3 0.625
     11 0.65
      5 0.675
      9 0.7
      5 0.725
      7 0.75
      8 0.775
      5 0.8
      3 0.825
      3 0.85
      4 0.875
      2 0.9
      1 0.925
      1 0.975
    109 1.0

But I would like to plot it using gnuplot my attempt was to modify following script that I've found. This is my modified version:

#!/usr/bin/gnuplot -p
# http://psy.swansea.ac.uk/staff/carter/gnuplot/gnuplot_frequency.htm

clear
reset

set datafile separator ",";
# set term dumb

set key off
set border 3

# Add a vertical dotted line at x=0 to show centre (mean) of distribution.
set yzeroaxis

# Each bar is half the (visual) width of its x-range.
set boxwidth 0.05 absolute
set style fill solid 1.0 noborder

bin_width = 0.1;
bin_number(x) = floor(x/bin_width)
rounded(x) = bin_width * ( bin_number(x) + 0.5 )

# MAKE BINS
# plot dataset_path using (rounded($6)):(6) smooth frequency with boxes

# DO NOT MAKE BINS
plot "data.csv" using 6:6 smooth frequency with boxes

This is the result:

this

It is saying something completely different than Unix tools. In gnuplot I've seen various types of histograms, e.g. some follows normal distribution pattern, others were ordered according to frequency (as if I replace the last sort -k2,2n with sort -n) another were ordered according to numbers from which histogram was created (mine case), etc. it would be nice if I could choose.

Glorfindel
  • 21,988
  • 13
  • 81
  • 109
Wakan Tanka
  • 7,542
  • 16
  • 69
  • 122

1 Answers1

4

smooth frequency renders the data monotonic in x (i.e. the value given in the first using column, in your case the numerical value from column 6), and then sums up all y-values (the values given in the second using column).

Here you also give the the sixth column, which is wrong if you want to count the number of occurrences of each distinct value in the sixth column, use using 6:(1), i.e. the numerical value 1 in the second column, to count the actual number of occurrences of each value:

set style fill solid noborder
set boxwidth 0.8 relative
set datafile separator ','
plot 'nupic_out.csv' using 6:(1) smooth frequency with boxes notitle

enter image description here

To apply a logscale to the smoothed data, you must first save them to a temporary file with set table ...; plot and then plot this temporary file.

set datafile separator ','
set table 'tmp.dat'
plot 'nupic_out.csv' using 6:(1) smooth frequency with lines
unset table

Here you must pay attention, because a bug in gnuplot adds a wrong last line to the output file which you must skip. You can either skip this by a filter in the using statement with e.g.

plot 'tmp.dat' using (strcol(3) eq "i" ? $1 : 1/0):2 with boxes

which works fine here, or you could use head to cut the last two lines like

plot '< head -n-2 tmp.dat' using 1:2 with boxes

Another point to note is, that gnuplot always uses white spaces to write out its data files, so you must change the data file separator back to whitespace before plotting tmp.dat.

A full working script could be

set style fill solid noborder
set boxwidth 0.8 relative
set datafile separator ','

set table 'tmp.dat'
plot 'nupic_out.csv' using 6:(1) smooth frequency with lines notitle
unset table

set datafile separator whitespace
set logscale y
set yrange [0.8:*]
set autoscale xfix
plot '< head -n-2 tmp.dat' using 1:2 with boxes notitle

enter image description here

Now, using a binning function for the values in the sixth column, you must replace the 6 in using 6:(1) by an function which operates on the value given in the sixth column. This function must be enclosed in () and you reference the current value in the sixth column using $6 inside the function, like

plot 'nupic_out.csv' using (bin($6)):(1) smooth frequency with lines

Again, a full working script, using ChrisW's binning function could be

set style fill solid noborder
set datafile separator ','

set boxwidth 0.09 absolute
Min = -0.05
Max = 1.05
n = 11.0
width = (Max-Min)/n
bin(x) = width*(floor((x-Min)/width)+0.5) + Min

set table 'tmp.dat'
plot 'nupic_out.csv' using (bin($6)):(1) smooth frequency with lines notitle
unset table

set datafile separator whitespace
set logscale y
set xrange [-0.05:1.05]
set tics nomirror out
plot '< head -n-2 tmp.dat' using 1:2 with boxes notitle

enter image description here

Community
  • 1
  • 1
Christoph
  • 47,569
  • 8
  • 87
  • 187
  • Thank you buddy this worked. Just few questions: 1. can I use log scale on y? I did `set logscale y 10` but the output was mess (also with another numbers). 2. can I reorder the x axis somehow (as I described in OP). 3. What is the purpose of enclosing the numbers inside ()? I've tried various versions with () here `seq 1 12 | xargs -n 2 echo | gnuplot -p -e "set style fill solid noborder; set boxwidth 0.3 relative; plot '-' using 2:1 smooth frequency with boxes notitle"` but I cannot figure it out. Thank you – Wakan Tanka May 19 '15 at 21:35
  • 1: No, you cannot directly use log scale on smoothed data, I'll extend my answer to include a possible solution for this. 2. Yes of course, instead of using `6` you can use anything you want as expression in the `using` statement, like `using (rounded($6)):(1)` or similar. But pay attention to the details of your binning function, see [this answer](http://stackoverflow.com/a/19596160/2604213). 3. `using 6:1` would use the sixth column for x and the first column for the y-value. Enclosing something in () makes gnuplot evaluate an expression, `(1)` gives the number 1, not the value in the 1. col – Christoph May 19 '15 at 21:55