I have a list of strings which have repeating values and I want to create dictionary of words where key will be the word and its value will be the frequency count and then write these words and their values in the csv:
The following has been my approach to do the same:
#!/usr/bin/env python
# encoding: utf-8
# -*- coding: utf8 -*-
import csv
from nltk.tokenize import TweetTokenizer
import numpy as np
tknzr = TweetTokenizer()
#print tknzr.tokenize(s0)
with open("dispn.csv","r") as file1,\
open("dispn_tokenized.csv","w") as file2,\
open("dispn_tokenized_count.csv","w") as file3:
mycsv = list(csv.reader(file1))
words = []
words_set = []
tokenize_count = {}
for row in mycsv:
lst = tknzr.tokenize(row[2])
for l in lst:
file2.write("\""+str(row[2])+"\""+","+"\""+str(l.encode('utf-8'))+"\""+"\n")
l = l.lower()
words.append(l)
words_set = list(set(words))
print "len of words_set : " + str(len(words_set))
for word in words_set:
tokenize_count[word] = 1
for word in words:
tokenize_count[word] = tokenize_count[word]+1
print "len of tokenized words_set : " + str(len(tokenize_count))
#print "Tokenized_words count : "
#print tokenize_count
#print "================================================================="
i = 0
for wrd in words_set:
#i = i+1
print "i : " +str(i)
file3.write("\""+str(i)+"\""+","+"\""+str(wrd.encode('utf-8'))+"\""+","+"\""+str(tokenize_count[wrd])+"\""+"\n")
but in csv I still found some repeating values like 1,5,4,7,9.
Some info of the approach:
- dispn.csv = contains usernames of the users which I am tokenizing with the help of nltk module
- after tokenizing them, I am storing these words in the list 'words' and writing the words corresponding to the username to csv.
- creating set of it so as to get unique values out of list 'words' and storing it in 'words_set'
- then creating dictionary 'tokenize_count' with key as word and value as its frequency count and writing the same to csv.
Why am I getting only some of the numerical values repeated? Is there a better way to do this?