Your code contains a number of bugs and inefficiencies.
Before you proceed, perhaps spend a moment to figure out how to get your program to tell when an assumption of yours might not be correct. A good place to start is to add this after the assignment of ex
:
print('ex is a {0}: {1!r}'.format(type(ex), ex))
which prints out the type of the variable, and its value. With that in place, you will easily spot the problem
ex is a <class 'list'>: ['hi how are you']
A slightly more advanced technique is to use logging
, which allows you to easily disable the diagnostic messages when your code is working, and later enable them again if you want to make changes to your code and see that it still does what it's supposed to do.
import logging
logging.basicConfig(level=logging.DEBUG)
# ...
logging.debug('ex is a {0}: {1!r}'.format(type(ex), ex)))
When you are done debugging, simply change the logging.basicConfig()
to say level=logging.WARN
, which will disable the display of all logging.debug()
and logging.info()
output. See the documentation for details.
Another useful debugging aid is assert
:
assert isinstance(str, ex), 'ex is not a str: {0) ({1!r})'.format(type(ex), ex))
See the Python Wiki for some guidance. Notice that assert
statements can be disabled e.g. when you enable optimization of your Python code, so you should perhaps add explicit checks instead, or as well, in your code.
if not isinstance(str, ex):
raise TypeError('ex must be a str, not {0} ({1!r})'.format(type(ex), ex)))
Now, with that out of the way, here is a refactored version of your script with what I think you were trying to do.
#!/usr/bin/env python
import numpy as np
import logging
logging.basicConfig(level=logging.DEBUG, format='%(module)s:%(asctime)s:%(message)s')
words_to_index={'hi': 0, 'you': 1, 'me': 2, 'are': 3}
ex = "hi how are you" # single string, not list of strings
#print('ex is {0} (type {1})'.format(ex, type(ex)))
logging.debug('ex is {0} (type {1})'.format(ex, type(ex)))
assert isinstance(ex, str), 'ex should be a string (is {0} {1!r})'.format(type(ex), ex)
Z=ex.split(" ") # maybe choose a more descriptive variable name
#ans=[[1,1,0,1]] # never used, commented out
res=np.zeros(40)
#for i in range(0,len(ex)+1): # Looping over the wrong thing
for word in Z:
logging.debug('word is {0}'.format(word))
if word in words_to_index: # words_to_index is a hash; no need to loop
logging.debug('{0} found in {1}'.format(word, Z))
res[words_to_index[word]] += 1 # notice increment syntax
logging.debug('res[{0}] = {1}'.format(words_to_index[word], res[words_to_index[word]]))
print(res)
Of course, this is not using NLTK at all; the NLTK library contains a somewhat more advanced set of functions which already do some of this for you, starting with proper NLP tokenization etc, but does not actually contain a TF component. Perhaps start with Does NLTK have TF-IDF implemented? which has pointers to some existing implementations.