Count Words in Python

Question

I have a list of strings in python.

list = [ "Sentence1. Sentence2...", "Sentence1. Sentence2...",...]

I want to remove stop words and count occurrence of each word of all different strings combined. Is there a simple way to do it?

I am currently thinking of using CountVectorizer() from scikit and than iterating for each word and combining the results

What are stop words? So you want to concatenate to one long string and then count occurrences, is that correct? — Wouter, Apr 08 '15 at 21:25
Have a look at http://stackoverflow.com/questions/19560498/faster-way-to-remove-stop-words-in-python — Mazdak, Apr 08 '15 at 21:26
@wouter Basically you can think I have bunch of documents and I want to count how many times a word occurs across the document. — coder hacker, Apr 08 '15 at 21:27

score 1 · Answer 1 · answered Apr 08 '15 at 21:30

If you don't mind installing a new python library, I suggest you use gensim. The first tutorial does exactly what you ask:

# remove common words and tokenize
stoplist = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stoplist]
         for document in documents]

You will then need to create the dictionary for your corpus of document and create the bag-of-words.

dictionary = corpora.Dictionary(texts)
dictionary.save('/tmp/deerwester.dict') # store the dictionary, for future 
print(dictionary)

You can weight the result using tf-idf and stuff and do LDA quite easily after.

Have a look at the tutorial 1 here

That seems helpful, Thanks! – coder hacker Apr 08 '15 at 21:33 — coder hacker, Apr 08 '15 at 21:33

score 0 · Answer 2 · answered Apr 08 '15 at 21:26

0

You've failed to thoroughly explain what you have in mind, but this may be what you're looking for:

counts = collections.Counter(' '.join(your_list).split())

answered Apr 08 '15 at 21:26

Malik Brahimi

16,341
7
39
70

Does your code join the different strings in list by string? – coder hacker Apr 08 '15 at 21:29
Yes, all the strings are joined then split by whitespace. – Malik Brahimi Apr 08 '15 at 21:30

Count Words in Python

2 Answers2

Linked

Related