I have a large English corpus named SubIMDB and I want to make a list of all the words with their frequency. Meaning that how much they have appeared in the whole corpus. This frequency list should have some characteristics:
- The words like boy and boys or other grammatical features such as get and getting, the same word or lemma and if there are 3 boy and 2 boys it should list them as Boy 5. However, not for the cases like Go and Went which have irregular forms(or foot and feet)
- I want to use this frequency list as a kind of dictionary so whenever I see a word in another part of the program I want to check its frequency in this list. So, better if it is searchable without looking up all the of it.
My questions are:
- For the first problem, what should I do? Lemmatize? or Stemming? or how can I get that?
- For second, what kind of variable type I should set it to? like dictionary or lists or what?
- Is is the best to save it in csv?
- Is there any prepared toolkit for python doing this all?
Thank you so much.