0

I have a paragraph of text. I want to calculate all the possible combination of 2 words (2 words have to be next to each other) For example:

"I have 2 laptops, I have 2 chargers"

The result should be:

"I have": 2
"have 2": 2
"2 laptops": 1
"Laptops, I": (Dont count)
"2 chargers": 1

I tried Regex but the thing is that it doesnt count a string twice

I used: \b[a-z]{1,20}\b \b[a-z]{1,20}\b

Text: cold chain, energy storage device, industrial cooling system

It works almost but it doesn't include words such as "storage device", cooling system because it already takes energy storage and industrial cooling

Appreciate your advice

ComplicatedPhenomenon
  • 4,055
  • 2
  • 18
  • 45
  • 1
    Possible duplicate of [Iterate a list as pair (current, next) in Python](https://stackoverflow.com/questions/5434891/iterate-a-list-as-pair-current-next-in-python) and [How to count the frequency of the elements in a list?](https://stackoverflow.com/questions/2161752/how-to-count-the-frequency-of-the-elements-in-a-list) – wwii Jul 21 '19 at 14:10

1 Answers1

0

You can use zip to get groups of every two words and then use Counter to get the frequency

>>> from collections import Counter
>>> text = "I have 2 laptops, I have 2 chargers"
>>> words = text.split()

>>> d = {' '.join(words):n for words,n in Counter(zip(words, words[1:])).items() if not  words[0][-1]==(',')}
>>> print (d)
{'I have': 2, 'have 2': 2, '2 laptops,': 1, '2 chargers': 1}

>>> import json
>>> print (json.dumps(d, indent=4))
{
    "I have": 2,
    "have 2": 2,
    "2 I": 1,
    "2 chargers": 1
}
Sunitha
  • 11,777
  • 2
  • 20
  • 23
  • Hi, thanks for reply. I tried your approach. it works but how could we exclude those "laptops, I" which contain comma "," ? I have to go through thounsand of lines to make a statistic of 2-words combination and it's important to exclude those with comma in between – Duy Nguyen Jul 21 '19 at 14:36
  • By the way, a sample text from my work is like : " old chain, energy storage device, industrial cooling system, industrial iot, intelligent storage, managed service platform provider, power management, refrigerant, thermal energy storage" . So if i used your approach, a lot of 2-word phrases with comma in bettwen or in word-end will apear. which i want to exclude those result – Duy Nguyen Jul 21 '19 at 14:39
  • I have updated the answer to exclude words with comma – Sunitha Jul 21 '19 at 14:53