0

I have a list of paragraphs, where I want to run a zipf distribution on their combination.

My code is below:

from itertools import *
from pylab import *
from collections import Counter
import matplotlib.pyplot as plt


paragraphs = " ".join(targeted_paragraphs)
for paragraph in paragraphs:
   frequency = Counter(paragraph.split())
counts = array(frequency.values())
tokens = frequency.keys()

ranks = arange(1, len(counts)+1)
indices = argsort(-counts)
frequencies = counts[indices]
loglog(ranks, frequencies, marker=".")
title("Zipf plot for Combined Article Paragraphs")
xlabel("Frequency Rank of Token")
ylabel("Absolute Frequency of Token")
grid(True)
for n in list(logspace(-0.5, log10(len(counts)-1), 20).astype(int)):
    dummy = text(ranks[n], frequencies[n], " " + tokens[indices[n]],
    verticalalignment="bottom",
    horizontalalignment="left")

At first I have encountered the following error for some reason and do not know why:

IndexError: index 1 is out of bounds for axis 0 with size 1

PURPOSE I attempt to draw "a fitted line" in this graph, and assign its value to a variable. However I do not know how to add that. Any help would be much appreciated for both of these issues.

AlpU
  • 363
  • 1
  • 9
  • 26
  • it's no longer clear why the answer below is related to this question; please return the code in the post to its original state so that future readers can see the original problem and the solution – shortorian Aug 24 '16 at 04:54

1 Answers1

1

I don't know what targeted_paragraphs looks like, but I got your error using:

targeted_paragraphs = ['a', 'b', 'c']

Based on that it looks like the problem is in how you set up the for loop. You're indexing ranks and frequencies using a list generated from the length of counts, but that gives you an off-by-one error because (as far as I can tell) ranks, frequencies, and counts should all have the same length. Change the loop index to use len(counts)-1 as below:

for n in list(logspace(-0.5, log10(len(counts)-1), 20).astype(int)):
    dummy = text(ranks[n], frequencies[n], " " + tokens[indices[n]],
    verticalalignment="bottom",
    horizontalalignment="left")
shortorian
  • 1,082
  • 1
  • 10
  • 19
  • I just edited the question. I managed to solve the IndexError by constructing the loop, and adding what you offered. Now, I need to add the fitted-line. @dshort – AlpU Aug 24 '16 at 03:53
  • It might seem redundant but the rules of the site state that there should be only one topic per question, so please open a new question for the fit. – shortorian Aug 24 '16 at 03:55
  • Just created a new question for that. Link: http://stackoverflow.com/questions/39114402/constructing-zipf-distribution-with-matplotlib-fitted-line @dshort – AlpU Aug 24 '16 at 04:24
  • I just tried the code again and got the following error: IndexError: index -9223372036854775808 is out of bounds for axis 0 with size 1 @dshort – AlpU Aug 24 '16 at 04:32