0

I want to create a unigram and bigram count matrix for a text file along with a class variable into csv using Python The text file contains two columns which look like this

Text                                                  Class
I love the movie                                      Pos
I hate the movie                                      Neg

I want the unigram and bigram count for text column and the output should be written into csv file

I     hate      love        movie   the        class
1     0         1           1       1          Pos
1     1         0           1       1          Neg

Bigram

I love     love the     the movie     I hate    hate the         class
1            1              1         0          0               Pos
0            0              1         1          1               Neg

Anybody can help me to improve the below code into the above mentioned output format?

>>> import nltk
>>> from collections import Counter
>>> fo = open("text.txt")
>>> fo1 = fo.readlines()
>>> for line in fo1:
       bigm = list(nltk.bigrams(line.split()))
       bigmC = Counter(bigm)
       for key, value in bigmC.items():
           print(key, value)

('love', 'the') 1
('the', 'movie') 1
('I', 'love') 1
('I', 'hate') 1
('hate', 'the') 1
('the', 'movie') 1
Steffi Keran Rani J
  • 3,667
  • 4
  • 34
  • 56
Ashok Kumar Jayaraman
  • 2,887
  • 2
  • 32
  • 40

1 Answers1

3

I have made your input file a little more detailed just so you can believe that the solution works:

I love the movie movie
I hate the movie
The movie was rubbish
The movie was fantastic

The first line contains a word twice cause otherwise you can't tell that the counter is actually counting properly.

The solution:

import csv
import nltk
from collections import Counter
fo = open("text.txt")
fo1 = fo.readlines()
counter_sum = Counter()
for line in fo1:
       tokens = nltk.word_tokenize(line)
       bigrams = list(nltk.bigrams(line.split()))
       bigramsC = Counter(bigrams)
       tokensC = Counter(tokens)
       both_counters = bigramsC + tokensC
       counter_sum += both_counters
       # This basically collects the whole 'population' of words and bigrams in your document

# now that we have the population can write a csv

with open('unigrams_and_bigrams.csv', 'w', newline='') as csvfile:
    header = sorted(counter_sum, key=lambda x: str(type(x)))
    writer = csv.DictWriter(csvfile, fieldnames=header)
    writer.writeheader()
    for line in fo1:
          tokens = nltk.word_tokenize(line)
          bigrams = list(nltk.bigrams(line.split()))
          bigramsC = Counter(bigrams)
          tokensC = Counter(tokens)
          both_counters = bigramsC + tokensC
          cs = dict(counter_sum)
          bc = dict(both_counters)
          row = {}
          for element in list(cs):
                if element in list(bc):
                  row[element] = bc[element]
                else:
                  row[element] = 0
          writer.writerow(row)

So, I used and built on your initial approach. You did not say whether you wanted the bigrams and unigrams in seperate csv's so assumed you wanted them together. That would not be too hard for you to reprogram otherwise. To accumulate a population in this way is probably better done using tools already built into the NLP libraries, but interesting to see it can be done more low level. I'm using Python 3 by the way, you may need to change some things such as the use of list if you need to make it work in Python 2.

Some interesting references used were this one on summing counters which was new to me. Also, I had to ask a question to get your bigrams and unigrams grouped at separate ends of the CSV.

I know the code looks repetitive, but you need to run through all your lines first to get the headers for the csv before you can start writing it.

Here is the output in libreoffice

image of csv output

Your csv is going to get very wide as it collects all the unigrams and bigrams. If you really care to have the bigrams without brackets and commas in the headers, you can make some kind of function which will do that. It is probably better to leave them as tuples though in case you need to parse them into Python again at some point, and it's just as readable..

You didn't include the code which generated the class column, assume you have it, you can append the string 'Class' onto header before the header gets written to csv to create that column and to populate it,

row['Class'] = sentiment

on the second last line before the row gets written.

cardamom
  • 6,873
  • 11
  • 48
  • 102