This expression str(s).split(',') creates a list of strings that contain whitespace as the first character for all words except the first one (assuming the str(s) worked as expected). When you then do this: wordnet.synsets(w) you basically look up w which has the whitespace as the first character in wordnet and it is not there, so all synsets will be of length 0.
E.g. len(wordnet.synsets(' october')) will be zero.
I recommend debugging to
- check that the str(s) really creates a proper string and
- make sure your 'w's are actually the words (e.g. do not start with whitespace). A simple solution could be to use the .trim() method if the only issue is the whitespace
If you provide a df and a screenshot of your output for that df, it would be easier to pinpoint the issue.
Update: addiditional points based on your comments above:
Thank you, Fernanda. I've read your comments above (in the main thread). Here are a few more items you might find relevant:
- wordnet contains only a few adverbs, so in your approach, you might be losing some adverbs
- the synset counting is a bit slow. I'd use instead:
if in wordnet.synsets(word):
syntaxis. Maybe it will be faster
- be careful with the idea of using word occurrence counting idea, as a large proportion of totally valid words is rare (appear only once in the corpus even for large corpora). This is related to Zipf law.
- consider regular expressions based method to filter out words which contain unusual characters