-1

I am trying to clean a .csv file from all non-word characters for LDA model, however after I clean it using:

words = [re.sub(r'\W+','', st) for st in words]

I get some 'junk' leftover that may affect the model.

I tried doing this:

words = [re.sub(r'\W+',',', st) for st in words]

but it doesn't seem to solve the issue. Is there a way to delete all the characters that are before or after these non-word characters as well?

If I run the code without re.sub line, what I get is:

>>>'set', 'editorial//a/aeaf-e', '-bd-frd/afac,,', 'photo', 'ab-ddf,', 'recording', 'record', 'belief', 'institution', 'change'

After running it with re.sub line I get this:

>>>'set', 'editorialaaeafe', 'bdfrdafac', 'photo', 'abddf', 'recording', 'record', 'belief', 'institution', 'change'

What I want to get is:

 >>>'set', 'photo', 'recording', 'record', 'belief', 'institution', 'change'
  • 1
    Please provide a [mcve] including sample input and sample output to show the "junk" you are saying is left behind – G. Anderson Feb 13 '20 at 22:21
  • If you can filter non word characters from a **CSV** file from, then it is probably not a Comma Separated Values file... So I suppose that it is not what you want to do, but without more information I cannot guess what you are trying to achieve... – Serge Ballesta Feb 13 '20 at 22:29
  • @G.Anderson so for instance using my code I get from this - site///usscihoneybees~ to this - siteusscihoneybees and it appears as one of the tokens in my LDA model and I don't want it to. Does it make sense? – darkknight555 Feb 13 '20 at 22:48
  • @SergeBallesta it is a csv file, it just contains a lot of urls and numbers and for some reason when I try to tokenize it for lda, it takes in all of that as tokens. – darkknight555 Feb 13 '20 at 22:51
  • What I meant is that you want to clean is probably not the file itself but the extracted fields. But you will get little or even no help at all if you do not show some input data and the expected output with some explaination of the rationale for the change. Not that we do not want to help you but without that information we just cannot. – Serge Ballesta Feb 13 '20 at 22:55
  • @SergeBallesta ok, I understand, thank you for pointing that out. I edited the post, I added the input and expected output, I hope it can clear some things out. – darkknight555 Feb 13 '20 at 23:14
  • It looks like you only want to have all current items that contain just letters. Something like `words = [st if st.isalpha() for st in words]` – Jongware Feb 13 '20 at 23:45
  • Does this answer your question? [How to check if a word is an English word with Python?](https://stackoverflow.com/questions/3788870/how-to-check-if-a-word-is-an-english-word-with-python) – G. Anderson Feb 14 '20 at 16:14

1 Answers1

0

You have to test each word of the list against the regex. As the expression will be used more than once, it is better to compile it first:

reject = re.compile(r'\W+')
[w for w in words if not reject.search(w)]

You could also use a positive version:

clean = re.compile(r'\w+$')
[w for w in words if clean.match(w)]

From your sample input, both snippets give as expected:

['set', 'photo', 'recording', 'record', 'belief', 'institution', 'change']
Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252