1

I am working on creating a bag of words. I referred to this link https://pythonprogramminglanguage.com/bag-of-words/#respond

df = pd.read_csv('Twidb11.csv',error_bad_lines=False, sep='delimiter',  engine='python')
# Creating Bag of Words
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(df.Text)
print count_vect.fit_transform(df.Text).todense()
#X_train_counts.shape 
print count_vect.vocabulary_

It is giving me the words and their frequency but the words are not ordered in alphabetical order and u' symbol is there, as shown below. How to get rid of this?

Output : { u'binance': 28, u'they': 139, u'just': 83, u'global': 67, u'alternatives': 11, u'zcash': 168, u'years': 165, u'talks': 133, u'japan': 82, u'yes': 166, u'25': 1, u'chinese': 37, u'6000': 5, u'zzzpositive': 170, u'winner': 162, u'28': 2, u'actually':12 ....}

kavya sharma
  • 43
  • 1
  • 9

1 Answers1

0

u is the representation of unicode. if you dont want convert it into string using str()

1) to convert unicode string into string,

>>> my_dict = {str(i):j for i,j in my_dict.items()}
>>> print my_dict
>>> {'binance': 28, 'global': 67, 'chinese': 37, 'just': 83, '25': 1, 'zzzpositive': 170, 'alternatives': 11, '6000': 5, 'winner': 162, '28': 2, 'zcash': 168, 'actually': 12, 'they': 139, 'talks': 133, 'japan': 82, 'yes': 166, 'years': 165}

2) sort my_dict,

itemgetter will help you do it easier

>>> from operator import itemgetter

>>> dict(sorted(my_dict.items(), key=itemgetter(1))) # converted string unicode into str
>>> {'25': 1, 'winner': 162, 'chinese': 37, '6000': 5, 'binance': 28, 'zzzpositive': 170, 'alternatives': 11, 'just': 83, 'global': 67, '28': 2, 'zcash': 168, 'actually': 12, 'they': 139, 'talks': 133, 'japan': 82, 'yes': 166, 'years': 165}
>>> 

in one line,

>>> dict(sorted({str(i):j for i,j in my_dict.items()}.items(), key=itemgetter(1)))
Mohideen bin Mohammed
  • 18,813
  • 10
  • 112
  • 118