0

I am reading the lines from a CSV file; I am applying LDA algorithm to find the most common topic, after data processing in doc_processed, I am getting 'u' in every word but why? Please suggest me to remove 'u' from the doc+processed, my code in Python 2.7 is

data = [line.strip() for line in open("/home/dnandini/test/h.csv", 'r')]

stop = set(stopwords.words('english'))# stop words
exclude = set(string.punctuation) #to reomve the punctuation
lemma = WordNetLemmatizer() # to map with parts of speech

def clean(doc):

    stop_free = " ".join([i for i in doc.lower().split() if i not in stop])
    punc_free = ''.join(ch for ch in stop_free if ch not in exclude)
    normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())
    shortword = re.compile(r'\W*\b\w{1,2}\b')
    output=shortword.sub('', normalized)   
    return output  
doc_processed = [clean(doc) for doc in data] 

Output as doc_processed -

 [u'amount', u'ze69heneicosadien11one', u'trap', u'containing', u'little', u'microg', u'zz69ket', u'attracted', u'male', u'low', u'population', u'level', u'indicating', u'potent', u'sex', u'attractant', u'trap', u'baited', u'z6ket', u'attracted', u'male', u'windtunnel', u'bioassay', u'least', u'100fold', u'attractive', u'male', u'zz69ket', u'improvement', u'trap', u'catch', u'occurred', u'addition', u'z6ket', u'various', u'binary', u'mixture', u'zz69ket', u'including', u'female', u'ratio', u'ternary', u'mixture', u'zz69ket']

1 Answers1

0

the u'some string' format means it is a unicode string. See this question for more details on unicode strings themselves, but the easiest way to fix this is likely to str.encode the result before returning it from clean.

def clean(doc):
    # all as before until
    output = shortword.sub('', normalized).encode()
    return output

Note that attempting to encode unicode code points that don't translate directly to the default encoding (which appears to be ASCII. See sys.getdefaultencoding() on your system to check) will throw an error here. You can handle the error in various ways be defining the errors kwarg to encode.

s.encode(errors="ignore")  # omit the code point that fails to encode
s.encode(errors="replace")  # replace the failing code point with '?'
s.encode(errors="xmlcharrefreplace") # replace the failing code point with ' '
# Note that the " " above is U+FFFD, the Unicode Replacement Character.
Adam Smith
  • 52,157
  • 12
  • 73
  • 112