2

I have this code that that is supposed to compare a positive corpus of words to a subject text. It was doing fine until I discovered that the repeated text is not factored.

Text: this is a very good movie, it is so good

Positive List: good, better etc..

The script only counted "good" once in the following implementation:

 readFile = open('test.txt','r').read()
    readFileList = readFile.split('\n')

    counter = 0


    for eachNeg in negWords:
            if eachNeg in readFile:
                    counter -= 1
                    print eachNeg
    print counter


    for eachPos in posWords:
            if eachPos in readFile:
                    counter +=1
                    print eachPos
    print counter
ham-sandwich
  • 3,975
  • 10
  • 34
  • 46
Godwin
  • 21
  • 2

3 Answers3

1

The Code does exactly what you describe. You told python to add 1 to the counter if the word is in the text like in:

a in [aaaabbbbccc] 
>> True

You need another for loop to count every word:

for eachPos in posWords:
    for word in readFile:
        if eachPos == word:
            counter +=1
            print eachPos
print counter

Iam not 100% sure if you can iterate over readFile but iam positiv you can or at least can find a way to make it to a list As Bartlomiej Lewandow mentioned use readfile.split(). This is a realy naive way of doing this.

I think there is another aproch where you count the words first and then look if they are in your list. For that look into Collections and Counter this is amazing for your project!

https://stackoverflow.com/a/5829377/3863636

Community
  • 1
  • 1
Maximilian Kindshofer
  • 2,753
  • 3
  • 22
  • 37
1

You could achieve this with a nested for loop, however this isn't a great solution to a simple problem:

for posWord in posWords:
    for test in readFile:
        if i == test:
            counter +=1
            print i
print(counter)

This isn't an effective approach towards analysing sentiment, rather you are just checking if a no-context positive word exists in the text or not which doesn't tell you much. The way you are approaching this task ignores common semantics that make their way into every day language such as double negatives, palindromes, and so on. Also, it doesn't look like you are filtering out stop words from the text or stemming words. See Stemming Algorithms.

Sentiment Analysis should be the product of a statistic. Structured based approaches do not tend to be as useful as semantic implementations - however, this is up for debate (probably). Further, a supervised learning approach to [binary or multiclass] classifying text into predefined categories such as positive or negative. A typical approach to sentiment analysis is implementing the Naive Bayes framework, although more effective / powerful methods have been proposed (SVM, Hidden Markov Models, and so on). See notable resource 2.

Final Notes

Although I don't really work with sentiment analysis unless I'm trying to make my life easier or compliment something I'm already doing, I do research a couple of topics in Natural Language Processing. I strongly believe that the academic domain has far surpassed the efforts of that in the commercial arena, in fact, some of the results / conclusions / prices companies are generating is hysterical - I'm still to come across a decent implementation. I recommend if you would like to learn more about this area you read academic journals published in IEEE & ACM.

Notable Resources:

  1. Python NLTK - Natural Language Tool Kit
  2. Twitter Sentiment Analysis using Python & NLTK
  3. Sentiment Analysis & Opinion Mining
ham-sandwich
  • 3,975
  • 10
  • 34
  • 46
0

You are checking if a word in the posWords and negWords is contained in the whole file. That is why you only get 1 per each distinct word.

What you want to do is to loop through all the words in your file and see if they are included in the good/bad list.

To get a list of words from the file you can use split() without any parameters.

So for the negative words it would look like this:

readFile = open('test.txt','r').read()
readFileList = readFile.split()

counter = 0


for word in readFileList:
        if word in negWords:
                counter -= 1
                print word
print counter
Bartlomiej Lewandowski
  • 10,771
  • 14
  • 44
  • 75