1

I have a huge amount of sentences (just a bit over 100,000). Each one contains on average 10 words. I am trying to put them together into one big list so I can us Counter from the collections library to show me the frequency each word occurs. What I'm doing currently is this:

from collections import Counter
words = []
for sentence in sentenceList:
    words = words + sentence.split()
counts = Counter(words)

I was wondering if there is a way to do the same thing more efficiently. I've been waiting almost an hour now for this code to finish executing. I would think the concatenating is what is making this take so long since if I replace the line words = words + sentence.split() with print(sentence.split()) it finishes executing in seconds. Any help would be much appreciated.

user3204121
  • 59
  • 2
  • 7
  • Yes, you are using a quadratic time algorithm, because on each iteration, you rebuild a whole new list. Instead, `.extend` or `.append` to words, which efficiently modifies the list in-place. – juanpa.arrivillaga Apr 05 '19 at 22:59
  • 1
    Each time you do `words + sentence.split()`, you are creating a new list with shallow copies of the items in the list. This is going to hurt your performance. – user3483203 Apr 05 '19 at 22:59
  • @user3483203 it does not make shallow copies of the items in the list, it does not make copies of the items in the list at all. – juanpa.arrivillaga Apr 05 '19 at 23:00
  • if frequency of word is require then why not use dictionary ? – sahasrara62 Apr 05 '19 at 23:02
  • 3
    @prashantrana `Counter` is a dictionary. – iz_ Apr 05 '19 at 23:03
  • @juanpa.arrivillaga I found where I picked up that idea from. Is [this](https://stackoverflow.com/questions/1720421/how-do-i-concatenate-two-lists-in-python#comment13144572_1720432) comment incorrect? Do you have a link to the documentation that clarifies? – user3483203 Apr 05 '19 at 23:07
  • @user3483203 in that context, I understand what they are say, but it is simply not using terminology correctly. The shallow/deep copy distinction applies to the *object being copied*, i.e. a list is either shallow or deep copied. In a deep copy, the items in the list are themselves copied recursively, in a shallow copy, the items in a list **are not copied**, that is what makes it a shallow copy. – juanpa.arrivillaga Apr 05 '19 at 23:09

2 Answers2

3

Don't build a big, memory-hogging list if all you want to do is to count the elements. Keep updating the Counter object with new iterables instead:

counts = Counter()
for sentence in sentenceList:
    counts.update(sentence.split())
blhsing
  • 91,368
  • 6
  • 71
  • 106
2

You can use extend:

from collections import Counter
words = []
for sentence in sentenceList:
    words.extend(sentence.split())
counts = Counter(words)

Or, a list comprehension like so:

words = [word for sentence in sentenceList for word in sentence.split()]

If you don't need words later, you can pass a generator into Counter:

counts = Counter(word for sentence in sentenceList for word in sentence.split())
iz_
  • 15,923
  • 3
  • 25
  • 40