I am trying to clean a .csv file from all non-word characters for LDA model, however after I clean it using:
words = [re.sub(r'\W+','', st) for st in words]
I get some 'junk' leftover that may affect the model.
I tried doing this:
words = [re.sub(r'\W+',',', st) for st in words]
but it doesn't seem to solve the issue. Is there a way to delete all the characters that are before or after these non-word characters as well?
If I run the code without re.sub line, what I get is:
>>>'set', 'editorial//a/aeaf-e', '-bd-frd/afac,,', 'photo', 'ab-ddf,', 'recording', 'record', 'belief', 'institution', 'change'
After running it with re.sub line I get this:
>>>'set', 'editorialaaeafe', 'bdfrdafac', 'photo', 'abddf', 'recording', 'record', 'belief', 'institution', 'change'
What I want to get is:
>>>'set', 'photo', 'recording', 'record', 'belief', 'institution', 'change'