2

I have two text files. One of them is the whole text (text1) and the other is the number of unique words in text1. I need to calculate a monogram and then write it in a file. I've already tried this:

def countwords(mytext):
    import codecs
    file = codecs.open(mytext, 'r', 'utf_8')
    count = 0
    mytext = file.readlines()
    for line in mytext:
       words = line.split()
         for word in words:
            count = count + 1
         file.close()
    return(count)

def CalculateMonoGram(path, lex): 
     fid = open(path, 'r', encoding='utf_8')
     mypath = fid.read().split()
     fid1 = open(lex, 'r', encoding='utf_8')
     mylex = fid1.read().split()
     for word1 in mylex:
         if word1 in mypath:
             x = dict((word1, mypath.count(word1)) for word1 in mylex)
         for value in x:
             monogram = '\t' + str(value / countwords(lex))
             table.write(monogram)
rypel
  • 4,686
  • 2
  • 25
  • 36
m.khodakarami
  • 131
  • 11
  • 2
    Hi, welcome to stackoverflow! Could you edit your question so all the code is included in the code block? That way it would be cleaner. Also could you include what the problem with the code you already have is? e.g. is it giving an error, or a wrong result? What are the expected results? – koukouviou May 08 '15 at 05:13
  • 3
    Hi,thanksssi need to count the number of times a unique word is repeated – m.khodakarami May 08 '15 at 05:21
  • 2
    I think [collections.Counter](https://docs.python.org/2/library/collections.html#collections.Counter) is what you are looking for, check this answer: http://stackoverflow.com/a/5829377/3045022 – koukouviou May 08 '15 at 05:24
  • 3
    Hi,thanksss i need to count the Number of times a Unique word is repeated in the Whole text,and then divide that special number by the (len(uniquewords)) each time.the problem of my code is that the count is (1) all the time.i think i should use while statemen.but i dont know how?? – m.khodakarami May 08 '15 at 05:28
  • 2
    What do your files look like? Do they have a single line in them, are the words separated by commas, spaces? You need to provide more information (and fix the indentation of your code) if you want to stop getting downvotes and get some help – koukouviou May 08 '15 at 05:33
  • can u help me more please? – m.khodakarami May 08 '15 at 06:55
  • TThe file is a formal text,seprated by spaces and it is a corpous,not a single line.I am afraid i corrected the indentation. – m.khodakarami May 08 '15 at 07:55
  • u made my week((: finally i coud correct it ,i ll see the results in an hour and show u the code. – m.khodakarami May 08 '15 at 09:29
  • If you find some time, please take a look at this [Style Guide for Python Code](https://www.python.org/dev/peps/pep-0008/) called PEP8. Spaces around operators greatly improve readability and indentations are syntactically relevant. – rypel May 08 '15 at 10:16

1 Answers1

3

You can use collections.Counter and re.sub:

import re
import collections
with open("input.txt") as f1, open("sub_input.txt") as f2:
  pattern = "[^a-zA-Z]"
  frequencies = collections.Counter([re.sub(pattern, "", word.strip()) for line in f1.readlines() for word in line.split()])
  print [frequencies[word] for line in f2.readlines() for word in line.split()]

The above prints [4, 2] for input.txt:

asd,
asd. lkj lkj  sdf
sdf .asd  wqe qwe kl
dsf asd,. wqe

and sub_input.txt:

asd sdf

Breaking it down in case the code is unclear:

  • collections.Counter(iterable) constructs an unordered collection with elements from the iterable as dictionary keys and the number of times they occur as dictionary values.
  • The regex pattern [^a-zA-Z] matches any character that is not in the range a-z or A-Z. re.sub(pattern, substitute, string substitutes substrings matched by pattern with substitute in string. In this case, replacing all non-letter characters with the empty string.
EvenLisle
  • 4,672
  • 3
  • 24
  • 47
  • 3
    Excuse me but i dont need to replace anything with the empty string. – m.khodakarami May 08 '15 at 07:56
  • 2
    @user4555103 it's just a precaution as you have not specified how the input file is formatted nor provided a sample. – EvenLisle May 08 '15 at 08:05
  • 3
    mypath=#it is a corpous containing 20000 words mylex+#all uniquewords in mypath(about 5000 words).I want to count the frequency of each unique word in mypath(corpous) [one by one].Then i have to divide that frquency by (len(corpous)) – m.khodakarami May 08 '15 at 08:21