I have a dataset consisting of tokenized tuples. My steps of pre-processing were first, tokenizing the words, and then normalizing slang words. But then the slang words could consist of phrases with white spaces. I'm trying to do another round of tokenizing, but I couldn't figure out the way. Here's an example of my data.
firstTokenization normalized secondTokenization
0 [yes, no, cs] [yes, no, customer service] [yes, no, customer, service]
1 [nlp] [natural language processing] [natural, language, processing]
2 [no, yes] [no, yes] [no, yes]
I am trying to figure out a way to generate the secondTokenization column. Here's the code I'm currently working on...
tokenizer = MWETokenizer()
def tokenization (text):
return tokenizer.tokenize(text.split())
df['firstTokenization'] = df['content'].apply(lambda x: tokenization(x.lower()))
normalizad_word = pd.read_excel('normalisasi.xlsx')
normalizad_word_dict = {}
for index, row in normalizad_word.iterrows():
if row[0] not in normalizad_word_dict:
normalizad_word_dict[row[0]] = row[1]
def normalized_term(document):
return [normalizad_word_dict[term] if term in normalizad_word_dict else term for term in document]
df['normalized'] = df['firstTokenization'].apply(normalized_term)