-2

I have the a dataset of aprox 5.000 different words and 5000 lines:

Example of 2 lines

data = [["I", "am", "John"], ["Where", "is", "John","?"]]

And what I want to do is count how many different words I have of each.

result = {"I": 1, "am": 1, "John": 2, "Where":1, ...}

But have no idea how to do it efficiently

Any suggestions?

3 Answers3

1

You can use list comprehension as such

from collections import Counter
Counter([word for sentence in data for word in sentence])
# or even
Counter(word for sentence in data for word in sentence)
# so you don't create the list containing every word
ygorg
  • 750
  • 3
  • 11
1

I'll give you a high level algorithm. Let me know if you want the actual code.

  1. Create a dictionary called counts.
  2. Iterate over data.
  3. For each element in data, iterate over each string.
  4. For each string, check if that word is in counts. If it is, increment the count. Otherwise, set counts[word]=1.
  5. At the end, counts will have what you're looking for.

This takes O(n) time as you only visit each word once, so it's the most efficient you can perform this task.

Raphael Koh
  • 111
  • 9
0

The good news is there's a lot of convenience tools in in python standard library.

import itertools
from collections import Counter

data = [["I", "am", "John"], ["Where", "is", "John", "?"]]
result = Counter(itertools.chain(*data))
# result: Counter({'John': 2, 'I': 1, 'am': 1, 'Where': 1, 'is': 1, '?': 1})

The asterisk (*data) is a syntax for unpacking iterable item to the form of arguments,right I'm not good to explain it by text. Let's see for example:

data = [1, 2, 3, 4, 5];
print(*data)
print(data[0], data[1], data[2], data[3], data[4])

The 2nd line and 3rd line is equivalent.

Bi Ao
  • 704
  • 5
  • 11