0

I am using the popular word cloud library with source: https://github.com/jasondavies/d3-cloud

I am using a clone of this block: http://bl.ocks.org/blockspring/847a40e23f68d6d7e8b5

For my data, I would like to set the maximum number of words the word cloud takes. The cloud has some built in functions for rotation, font size, spiral method, ect. However, there do not appear to be any built in means for setting the max of words to be displayed.

I think it would be more computationally efficient to simply feed it a subset of the original word count. I didn't see any .sort calls, so I'm not sure if the word_count object is sorted by frequency yet before it goes to cloud.js or not.

If cloud.js sorts the word_count object it accepts by frequency or tf-idf, or whatever it uses, then I would have to wait to return the top k words until after it has made the list, implying it still iterated through my whole text file.

I still think if I can display only the top k (top as in most frequent, excluding the grammar words found in common_words), lets say 20, I will at least speed up the visual (not sure about speeding up the actual algorithm).

If that was not clear, let me explain it using a visual approach. It seems that the more frequent a word appears, the bigger its font size, I think that is an intuitive way to grasp cloud.js, so the top k will be k of the largest font-size.

So can someone with experience in this kind of visualization tell me where to tweak the code for returning top k words and how?

Note: I had originally posted this question on the git hub page, but it was marked as off-topic, so I was advised to post here. My initial fear was that this would be marked as too vague for stack overflow, so I have since tried to make the question less abstract and provide as much information as I could. Please bear this in mind.

Thank you

Arash Howaida
  • 2,575
  • 2
  • 19
  • 50
  • I added a direct, terse programming question with regards to d3 at the end, as well as visual approach to understanding my problem. – Arash Howaida Jan 15 '17 at 16:32

1 Answers1

1

Perhaps

var words = text_string.split(/[ '\-\(\)\*":;\[\]|{},.!?]+/),
  limit = 5;
if (words.length == 1) {
  word_count[words[0]] = 1;
} else {
  words.forEach(function(word) {
    var word = word.toLowerCase();
    if (word != "" && common.indexOf(word) == -1 && word.length > 1) {
      if (word_count[word]) {
        word_count[word]++;
      } else {
        word_count[word] = 1;
      }
    }
  });
  for (var word in word_count) {
    if (word_count[word] < limit) delete word_count[word];
  }
}

You might want to add a counter and if too many words, lower the limit until Object.keys(word_count).length < 20000

mplungjan
  • 169,008
  • 28
  • 173
  • 236
  • 1
    Looks great, I'm experimenting with it now. Thank you for looking it to it! – Arash Howaida Jan 15 '17 at 16:38
  • 1
    It's a great solution, just what I needed! I have tried it with large documents and it does really well. However on smaller documents it tends to remove everything. So I will continue to experiment with different metrics for `limit` maybe use standard deviations or some ratio that relates the word count to document length. – Arash Howaida Jan 15 '17 at 16:51
  • I am curious, why do you handle the single element case separately? – pintxo Jan 15 '17 at 16:57
  • I did not write that part but processing one word may end up creating an empty object – mplungjan Jan 15 '17 at 16:59
  • Just out of curiosity, is there a way to create a subset of the word_count object? In Python there is the slightly complicated comprehension method: `def get_range(dictionary, begin, end): return {k: v for k, v in dictionary.items() if begin <= k <= end}` We'd have to know where the important words are in word_count for it to work as well, does javascript have an easier means of accessing a subset of such an object? – Arash Howaida Jan 16 '17 at 02:35
  • My limit did exactly that (from 0 to 5) if I am not mistaken. Change to `if (word_count[word] > end || word_count[word] < begin)` or add to new object Alternatively sort http://stackoverflow.com/questions/1069666/sorting-javascript-object-by-property-value – mplungjan Jan 16 '17 at 05:08
  • Ok interesting, I will see if I can use that approach here. I was wondering because at some point I might be required to create a new object where I specify k (k being the total number of unique words), and although the limit approach effectively trims down many of the uncommon words, the length of the resulting word_count object varies between different documents. So, if there is away to select only, lets say 20 words, that might work universally ( for big and small documents). Whereas now, special treatment is needed for both. – Arash Howaida Jan 16 '17 at 08:52