Remove different meaningless tokens from text in Python

Question

I am new to topic modeling. After doing tokenizing using NLTK, for example I have following tokens:

'1-in', '1-joerg', '1-justine', '1-lleyton', '1-million', '1-nil', '1of','00pm-ish', '01.41', '01.57','0-40', '0-40f',

I believe they are meaningless and can not help me in the rest of my process. Is it correct? If so, is there anyone who has an idea about regular expression or ... that should be used to remove these tokens from my token list(they are so different and I could not think of a regexp for this purpose)?

You can create a regex for each pattern within the list of tokens and then `|` them. If you have really many, then maybe whitelisting the good stuff helps you instead of blacklisting these tokens. — a_guest, Oct 06 '18 at 23:29
these are examples! I can't write regexp for all of them one by one. I asked this question to see if anyone knows a way to find a general regexp for similar examples. something like removing strings which length of numerical values is higher than that of non-numerical — user3665906, Oct 07 '18 at 05:29
*"something like removing strings which length of numerical values is higher than that of non-numerical"* This can be done, for example. You just need to be explicit about all the different patterns that can occur. Just like to one with numerical / non-numerical values. Once you identified all possible patterns, you can create the corresponding (joint) regex. If blacklisting is too difficult, then whitelisting might be an option as well. — a_guest, Oct 07 '18 at 21:53

john smith · Answer 1 · 2018-10-14T07:07:15.910

I've found the easiest way to get rid of word I don't want in a string is to replace them with a blank space using csv.

import re

def word_replace(text, replace_dict):
rc = re.compile(r"[A-Za-z_]\w*")

def translate(match):
    word = match.group(0).lower()
    print(word)
    return replace_dict.get(word, word)

return rc.sub(translate, text)

old_text = open('C:/the_file_with_this_string').read()

replace_dict = {
"unwanted_string1" : '',
"unwanted_string2" : '',
"unwanted_string3" : '',
"unwanted_string4" : '',
"unwanted_string5" : '',
"unwanted_string6" : '',
"unwanted_string7" : '',
"unwanted_string8" : '',
"unwanted_string9" : '',
"unwanted_string10" : ''
 }

output = word_replace(old_text, replace_dict)
f = open("C:/the_file_with_this_string", 'w')
f.write(output)
print(output)

replace 'C:/the_file_with_this_string' with the path to the file with the string

replace unwanted_string(#) with the string you want to get rid of

can't do that one by one because there are a lot of examples — user3665906, Oct 07 '18 at 05:38

Remove different meaningless tokens from text in Python

1 Answers1