Extracting Word Frequency List from a Large Corpus

Question

I have a large English corpus named SubIMDB and I want to make a list of all the words with their frequency. Meaning that how much they have appeared in the whole corpus. This frequency list should have some characteristics:

The words like boy and boys or other grammatical features such as get and getting, the same word or lemma and if there are 3 boy and 2 boys it should list them as Boy 5. However, not for the cases like Go and Went which have irregular forms(or foot and feet)
I want to use this frequency list as a kind of dictionary so whenever I see a word in another part of the program I want to check its frequency in this list. So, better if it is searchable without looking up all the of it.

My questions are:

For the first problem, what should I do? Lemmatize? or Stemming? or how can I get that?
For second, what kind of variable type I should set it to? like dictionary or lists or what?
Is is the best to save it in csv?
Is there any prepared toolkit for python doing this all?

Thank you so much.

Questions that ask "where do I start?" are typically too broad and are not a good fit for this site. People have their own method for approaching the problem and because of this there cannot be a _correct_ answer. Give a good read over [Where to Start](https://softwareengineering.meta.stackexchange.com/questions/6366/where-to-start/6367#6367) and [edit] your post. — Patrick Artner, Jan 13 '19 at 17:37
asking for library recommendations is offtopic and asking multiple and/or unspecific questions in one question is also discoraged. What did you try, where did you research and what is your problem with the code you used to solve this? Googeling `python lemma stemmer` naturally leads to NLTK and to duplicates on this very site ... f.e. this one [how-do-i-do-word-stemming-or-lemmatization](https://stackoverflow.com/questions/771918/how-do-i-do-word-stemming-or-lemmatization) — Patrick Artner, Jan 13 '19 at 17:39
@PatrickArtner well, I know where to start, I just have those questions and I want to know the opinion of others about it. I am a little confused about lemmetizing or stemming them. And also what is the best way to make that frequency list in their opinion. Where do you think I should ask this? the linguistics in stackexchange? — Alireza M. Kamelabad, Jan 13 '19 at 17:43
@PatrickArtner I know NLTK lemma and stemmer and I am able to work with them. I ask which one is better here considering that I just need to get rid of grammatical part. And also I searched well Github and the whole web but could not find a good python toolkit making frequency lists automatically of a .txt file. I wanted to ask people here if they know any. — Alireza M. Kamelabad, Jan 13 '19 at 17:47

score 0 · Accepted Answer · answered Jan 13 '19 at 17:56

As pointed above, question(s) is a opinion based and vague, but here's some directions:

Both will work for your case. Stemming usually is simpler and faster. I suggest starting with nltk's PorterStemmer. If you need sophisticated lemmatization, take a look at spaCy, IMO that's industry standard.
You need dictionary, which gives you amortized O(1) lookup once you have your stem/lemma. Also counter may become useful.
Depends on your usecase. CSV is more "portable", pickle may be easier to use.
There's a lot of "building blocks" in nltk and spaCy, building your pipeline/models is up to you

Extracting Word Frequency List from a Large Corpus

1 Answers1