-1

I have a list of strings which have repeating values and I want to create dictionary of words where key will be the word and its value will be the frequency count and then write these words and their values in the csv:

The following has been my approach to do the same:

#!/usr/bin/env python
# encoding: utf-8

# -*- coding: utf8 -*-
import csv
from nltk.tokenize import TweetTokenizer
import numpy as np

tknzr = TweetTokenizer()

#print tknzr.tokenize(s0)

with open("dispn.csv","r") as file1,\
     open("dispn_tokenized.csv","w") as file2,\
     open("dispn_tokenized_count.csv","w") as file3:

     mycsv = list(csv.reader(file1))

     words = []
     words_set = []
     tokenize_count = {}
     for row in mycsv:
         
         lst = tknzr.tokenize(row[2])
         for l in lst:
             file2.write("\""+str(row[2])+"\""+","+"\""+str(l.encode('utf-8'))+"\""+"\n")
             l = l.lower()
             words.append(l)
     words_set = list(set(words))
     print "len of words_set : " + str(len(words_set))
     for word in words_set:
        tokenize_count[word] = 1
        
     for word in words:
        tokenize_count[word] = tokenize_count[word]+1
        

   

     print "len of tokenized words_set : " + str(len(tokenize_count))

     #print "Tokenized_words count : "
     #print tokenize_count
     #print "================================================================="
                         
     i = 0
     for wrd in words_set:
       #i = i+1
       print "i : " +str(i)
       file3.write("\""+str(i)+"\""+","+"\""+str(wrd.encode('utf-8'))+"\""+","+"\""+str(tokenize_count[wrd])+"\""+"\n")

but in csv I still found some repeating values like 1,5,4,7,9.

Some info of the approach:

  • dispn.csv = contains usernames of the users which I am tokenizing with the help of nltk module
  • after tokenizing them, I am storing these words in the list 'words' and writing the words corresponding to the username to csv.
  • creating set of it so as to get unique values out of list 'words' and storing it in 'words_set'
  • then creating dictionary 'tokenize_count' with key as word and value as its frequency count and writing the same to csv.

Why am I getting only some of the numerical values repeated? Is there a better way to do this?

halfer
  • 19,824
  • 17
  • 99
  • 186
POOJA GUPTA
  • 2,295
  • 7
  • 32
  • 60
  • 1
    [`import Counter from collections`](https://docs.python.org/2/library/collections.html#collections.Counter) – R Nar Nov 19 '15 at 17:28
  • Possible duplicate of [How to count the frequency of the elements in a list?](http://stackoverflow.com/questions/2161752/how-to-count-the-frequency-of-the-elements-in-a-list) – Nir Alfasi Nov 19 '15 at 17:33
  • @RNar : can you pls post your comment as answer so that I will accept it ? thanks it solved my problem – POOJA GUPTA Nov 19 '15 at 17:37
  • @RNar : thanks .. :) but my approach was correct too.. i just saw that when i opened csv with excel, it was incorrectly interpreting the data because of which i felt my approach was incorrect. nevermind, i learned the new technique today as mine is naive. Thanks for making me learn that :) – POOJA GUPTA Nov 19 '15 at 17:56
  • 1
    Im sure it was, this is just way easier. Why reinvent the wheel :) – R Nar Nov 19 '15 at 17:57

1 Answers1

1

`import Counter from collections

Counter can be called on a list of strings and return a dict-like object where the key values are words and their frequencies

R Nar
  • 5,465
  • 1
  • 16
  • 32