-1

I have a open source dictionary / thesaurus and I want to find out the following about each words in the dictionary / thesaurus:

  1. The frequency of the word and its synonyms used in any available open corpus. I could find some open corpus like on the Stanford NLP page but none for word frequency corpus. Is there any open source word frequency corpus already available? If no, I am looking for some pointers to build one.

  2. Is there any algorithm / heuristic that classify words into different difficulty levels (eg. very hard, difficult, medium, easy etc) ? Although subjective, but may be the rarity/ frequency of use, ambiguity of meaning i.e. usage in different sense, difficulty of spelling, no of letters in the word etc can be used to classify them. I am looking for any open source package that I can use to find these features especially the word frequency and build a corpus that classify words with difficulty levels.

Shan
  • 5,054
  • 12
  • 44
  • 58
  • Here's one: http://www.ark.cs.cmu.edu/TweetNLP/cluster_viewer.html It has frequencies for each word in each cluster. – Yasen Apr 11 '14 at 13:25
  • First question is a request for a resource, which is off-topic. The second question is rather broad. – Adrian McCarthy Nov 25 '15 at 21:31

1 Answers1

1

1) The British National Corpus (BNC) is not open source, but you can find frequency lists here: http://www.kilgarriff.co.uk/bnc-readme.html

2) I don't know whether such package exists. It looks like a supervised machine learning task to me. Just to give you a couple of ideas: you could use the following features: - syllable count (see for example Detecting syllables in a word) - lemmata count: more entries indicate ambiguity - PoS candidate count (probably weaker than lemmata count) An easy-to-use annotation and machine learning environment can be found here (Gate): https://gate.ac.uk/sale/tao/splitch19.html#x24-46100019.2

Community
  • 1
  • 1
jvdbogae
  • 1,241
  • 9
  • 15