I have a open source dictionary / thesaurus and I want to find out the following about each words in the dictionary / thesaurus:
The frequency of the word and its synonyms used in any available open corpus. I could find some open corpus like on the Stanford NLP page but none for word frequency corpus. Is there any open source word frequency corpus already available? If no, I am looking for some pointers to build one.
Is there any algorithm / heuristic that classify words into different difficulty levels (eg. very hard, difficult, medium, easy etc) ? Although subjective, but may be the rarity/ frequency of use, ambiguity of meaning i.e. usage in different sense, difficulty of spelling, no of letters in the word etc can be used to classify them. I am looking for any open source package that I can use to find these features especially the word frequency and build a corpus that classify words with difficulty levels.