1

My goal is to have a simple .dat file and, from it, to plot both the actual data and the theoretical points of a perfect Zipf distribution, that is, a distribution where every item has a value equal to 1/(rank).

For instance, my data for most followed Instagram accounts is:

# List of most followed users on instagram
# By rank and millions of followers
# From Wikipedia
# https://en.wikipedia.org/wiki/List_of_most_followed_users_on_Instagram
# rank, millions of followers

1 222
2 120
3 105
4 101
5 101
6 100
7 99 
8 93 
9 86 
10 85
11 80
12 79
13 76
14 73
15 71
16 69
17 67
18 65
19 63
20 63

From another thread I learned that I can just append a new column with the ideal Zipf distribution values per rank (in this case, 222, 111, 74, 55.5 etc) and then run the second plot as ,'' using 1:3 but this requires manually doing the calculation and appending it to the original file and that's the step I'm trying to avoid. Is this possible? How could I extend it to other distributions/calculations of data?

Andycyca
  • 13
  • 4

1 Answers1

0

Use stats to calculate the maximum value of the second column with

stats 'file.dat' u 2 nooutput
max = STATS_max

Then you calculate the Zipf distribution with (max/$1)

plot 'file.dat' u 1:2 pt 7 t 'data',\
     '' u 1:(max/$1) w l t 'ideal Zipf'
Christoph
  • 47,569
  • 8
  • 87
  • 187
  • I'm still a newbie, What exactly are you doing with `stats`? If I understand the Zipf distribution (which I may not), shouldn't the ideal Distribution here be *below* the data? For instance, the second data is 120, but the ideal would be 111, shouldn't it? Unless I'm missing something. Sorry for the follow up :S – Andycyca May 12 '17 at 20:51
  • I use stats to calculate the maximum value of the second column. Regarding the data you are right, the first part should simply be `u 1:2` – Christoph May 12 '17 at 20:58