Word frequency corpus for natural language processing

Question

I have a open source dictionary / thesaurus and I want to find out the following about each words in the dictionary / thesaurus:

The frequency of the word and its synonyms used in any available open corpus. I could find some open corpus like on the Stanford NLP page but none for word frequency corpus. Is there any open source word frequency corpus already available? If no, I am looking for some pointers to build one.
Is there any algorithm / heuristic that classify words into different difficulty levels (eg. very hard, difficult, medium, easy etc) ? Although subjective, but may be the rarity/ frequency of use, ambiguity of meaning i.e. usage in different sense, difficulty of spelling, no of letters in the word etc can be used to classify them. I am looking for any open source package that I can use to find these features especially the word frequency and build a corpus that classify words with difficulty levels.

Here's one: http://www.ark.cs.cmu.edu/TweetNLP/cluster_viewer.html It has frequencies for each word in each cluster. — Yasen, Apr 11 '14 at 13:25
First question is a request for a resource, which is off-topic. The second question is rather broad. — Adrian McCarthy, Nov 25 '15 at 21:31

score 1 · Answer 1 · edited May 23 '17 at 11:44

1) The British National Corpus (BNC) is not open source, but you can find frequency lists here: http://www.kilgarriff.co.uk/bnc-readme.html

2) I don't know whether such package exists. It looks like a supervised machine learning task to me. Just to give you a couple of ideas: you could use the following features: - syllable count (see for example Detecting syllables in a word) - lemmata count: more entries indicate ambiguity - PoS candidate count (probably weaker than lemmata count) An easy-to-use annotation and machine learning environment can be found here (Gate): https://gate.ac.uk/sale/tao/splitch19.html#x24-46100019.2

Word frequency corpus for natural language processing

1 Answers1