A program that counts specific stings in a text, such like "climate finance"

Question

from collections import Counter
input = 'file.txt'

CounterWords = {}
words = {}
with open(input,'r', encoding='utf-8-sig') as fh:
  for line in fh:
    word_list = line.replace(',','').replace('\'','').replace('.','').lower().split()
    for word in word_list:
      if len(word) < 6
          continue
      elif word not in CounterWords:
          CounterWords[word] = 1
      else:
          CounterWords[word] = CounterWords[word] + 1
N = 50

top_words = Counter(CounterWords).most_common(N)
for word, frequency in top_words:
    print("%s %d" % (word, frequency))

At the moment i am able two select the most frequent words with strings more than X characters.

The program should screen the text and count words like:

"climate finance" "market failure" "Paris 2015"

Amount of minimum characters per single string should be still included to prevent results such as "I and".

maybe first you should remove all short words - so called "stopwords". You can use stopwords from module [NLTK](http://www.nltk.org/book/). See: [Stopword removal with NLTK](https://stackoverflow.com/questions/19130512/stopword-removal-with-nltk) — furas, Oct 17 '19 at 07:23
"get word_list[i:i+1]" looks easy. Where do i have to put it? Within the loop or at the beginning? — MichaelKo, Oct 17 '19 at 08:05
create new for-loop `for i in range(len(word_list)-1): word_list[i:i+1]` — furas, Oct 17 '19 at 08:20
I do: `for word in range(len(word_list)-1): word_list[word:word+1]` (instead of `for word in word_list:` from the original code above (line 9) I got a error for the next line `if len(word) < 6:` -> **TypeError: object of type 'int' has no len()** — MichaelKo, Oct 17 '19 at 08:43
you are wrong. I used name `i` because it keeps index to `word_list`, not word. You have two words in `first, second = word_list[i:i+1]` — furas, Oct 17 '19 at 08:47

pymym213 · Answer 1 · 2019-10-17T14:56:06.147

0

You can simply use your_file_content.count(your_string) :

from collections import Counter
input = 'D:\\file.txt'

import itertools
def pairwise(iterable):
    # "s -> (s0,s1), (s1,s2), (s2, s3), ..."
    a, b = itertools.tee(iterable)
    next(b, None)
    return zip(a, b)  

CounterWords = {}
CounterPairs = {}
words = {}
file_content = ''
with open(input,'r', encoding='utf-8-sig', errors='ignore') as fh:
  file_content = fh.read().replace('\n', ' ')
  word_list = file_content.replace(',','').replace('\'','').replace('.','').lower().split()
  word_list = list(dict.fromkeys(word_list)) # to remove duplicates
  word_pairs_list = pairwise(word_list)
  for word in word_list:
    if len(word) < 6:
      continue
    else:
      CounterWords[word] = file_content.count(word)
  for pair in word_pairs_list:
    CounterPairs[pair] = file_content.count(' '.join(pair))
N = 50

# for all single words :
top_words = Counter(CounterWords).most_common(N)
for word, frequency in top_words:
  print("%s %d" % (word, frequency))

# for all pairs :
top_pairs = Counter(CounterPairs).most_common(N)
for pair, frequency in top_pairs:
  print("%s %d" % (pair, frequency))

# for specific pairs :
print("\n%s %d" % ('climate finance', CounterPairs[('climate', 'finance')]))

pairwise function taken from : Iterate a list as pair (current, next) in Python

edited Oct 17 '19 at 14:56

answered Oct 17 '19 at 07:41

pymym213

321
3
10

thx I am interested in a generic code. At the moment i do not know to which terms are the appear heavily. Do you have an idea? – MichaelKo Oct 17 '19 at 07:53
what do you mean by terms ? is "Paris 2015" a term ?, you called them words in your question. Please clarify. – pymym213 Oct 17 '19 at 08:16
I screen a document with 100000 words and i want to know: First, analyse which words (i.e. "climate"), second which word-pairs (i.e. "climate finance") are the most common in the text. – MichaelKo Oct 17 '19 at 08:21
First, use nltk as @furas suggested. Second, make sure to properly generate single and pair words. Finally, use `.count()` and `Counter` to analyse all this. – pymym213 Oct 17 '19 at 08:46
some help for generating pair words from a list of words : https://stackoverflow.com/a/5434936/4374588 – pymym213 Oct 17 '19 at 08:48
what i get is **(current item, next item)**, but i want to see the frequency of specific word pairs that belong together to each other (i. e. "climate finance") – MichaelKo Oct 17 '19 at 13:34
Example: When i am running the loop `for word in word_list: if len(word) < 6: ...` -> The loop should count word pairs that appear multiple times and merge the two words to one sting. Does someon now a trick? – MichaelKo Oct 17 '19 at 13:38
you just do the same thing you did for single words using `' '.join((current_item, next_item))` and append it to `sentences_to_be_counted` – pymym213 Oct 17 '19 at 13:40
now, the code in my (edited) answer works fine for single and pairs. don't forget to correctly remove stopword and such from your `word_list` instead of testing if length < 6. – pymym213 Oct 17 '19 at 14:22

A program that counts specific stings in a text, such like "climate finance"

1 Answers1