0
from collections import Counter
input = 'file.txt'

CounterWords = {}
words = {}
with open(input,'r', encoding='utf-8-sig') as fh:
  for line in fh:
    word_list = line.replace(',','').replace('\'','').replace('.','').lower().split()
    for word in word_list:
      if len(word) < 6
          continue
      elif word not in CounterWords:
          CounterWords[word] = 1
      else:
          CounterWords[word] = CounterWords[word] + 1
N = 50

top_words = Counter(CounterWords).most_common(N)
for word, frequency in top_words:
    print("%s %d" % (word, frequency))

At the moment i am able two select the most frequent words with strings more than X characters.

The program should screen the text and count words like:

"climate finance" "market failure" "Paris 2015"

Amount of minimum characters per single string should be still included to prevent results such as "I and".

MichaelKo
  • 41
  • 6
  • get `word_list[i:i+1]` and work with this. – furas Oct 17 '19 at 07:18
  • maybe first you should remove all short words - so called "stopwords". You can use stopwords from module [NLTK](http://www.nltk.org/book/). See: [Stopword removal with NLTK](https://stackoverflow.com/questions/19130512/stopword-removal-with-nltk) – furas Oct 17 '19 at 07:23
  • "get word_list[i:i+1]" looks easy. Where do i have to put it? Within the loop or at the beginning? – MichaelKo Oct 17 '19 at 08:05
  • create new for-loop `for i in range(len(word_list)-1): word_list[i:i+1]` – furas Oct 17 '19 at 08:20
  • I do: `for word in range(len(word_list)-1): word_list[word:word+1]` (instead of `for word in word_list:` from the original code above (line 9) I got a error for the next line `if len(word) < 6:` -> **TypeError: object of type 'int' has no len()** – MichaelKo Oct 17 '19 at 08:43
  • you are wrong. I used name `i` because it keeps index to `word_list`, not word. You have two words in `first, second = word_list[i:i+1]` – furas Oct 17 '19 at 08:47

1 Answers1

0

You can simply use your_file_content.count(your_string) :

from collections import Counter
input = 'D:\\file.txt'

import itertools
def pairwise(iterable):
    # "s -> (s0,s1), (s1,s2), (s2, s3), ..."
    a, b = itertools.tee(iterable)
    next(b, None)
    return zip(a, b)  

CounterWords = {}
CounterPairs = {}
words = {}
file_content = ''
with open(input,'r', encoding='utf-8-sig', errors='ignore') as fh:
  file_content = fh.read().replace('\n', ' ')
  word_list = file_content.replace(',','').replace('\'','').replace('.','').lower().split()
  word_list = list(dict.fromkeys(word_list)) # to remove duplicates
  word_pairs_list = pairwise(word_list)
  for word in word_list:
    if len(word) < 6:
      continue
    else:
      CounterWords[word] = file_content.count(word)
  for pair in word_pairs_list:
    CounterPairs[pair] = file_content.count(' '.join(pair))
N = 50

# for all single words :
top_words = Counter(CounterWords).most_common(N)
for word, frequency in top_words:
  print("%s %d" % (word, frequency))

# for all pairs :
top_pairs = Counter(CounterPairs).most_common(N)
for pair, frequency in top_pairs:
  print("%s %d" % (pair, frequency))

# for specific pairs :
print("\n%s %d" % ('climate finance', CounterPairs[('climate', 'finance')]))

pairwise function taken from : Iterate a list as pair (current, next) in Python

pymym213
  • 321
  • 3
  • 10
  • thx I am interested in a generic code. At the moment i do not know to which terms are the appear heavily. Do you have an idea? – MichaelKo Oct 17 '19 at 07:53
  • what do you mean by terms ? is "Paris 2015" a term ?, you called them words in your question. Please clarify. – pymym213 Oct 17 '19 at 08:16
  • I screen a document with 100000 words and i want to know: First, analyse which words (i.e. "climate"), second which word-pairs (i.e. "climate finance") are the most common in the text. – MichaelKo Oct 17 '19 at 08:21
  • First, use nltk as @furas suggested. Second, make sure to properly generate single and pair words. Finally, use `.count()` and `Counter` to analyse all this. – pymym213 Oct 17 '19 at 08:46
  • some help for generating pair words from a list of words : https://stackoverflow.com/a/5434936/4374588 – pymym213 Oct 17 '19 at 08:48
  • what i get is **(current item, next item)**, but i want to see the frequency of specific word pairs that belong together to each other (i. e. "climate finance") – MichaelKo Oct 17 '19 at 13:34
  • Example: When i am running the loop `for word in word_list: if len(word) < 6: ...` -> The loop should count word pairs that appear multiple times and merge the two words to one sting. Does someon now a trick? – MichaelKo Oct 17 '19 at 13:38
  • you just do the same thing you did for single words using `' '.join((current_item, next_item))` and append it to `sentences_to_be_counted` – pymym213 Oct 17 '19 at 13:40
  • now, the code in my (edited) answer works fine for single and pairs. don't forget to correctly remove stopword and such from your `word_list` instead of testing if length < 6. – pymym213 Oct 17 '19 at 14:22