I have some text that I was able to process from stop words, links, emoticons, etc. After tokenizing my dataframe, I get a not-so-good picture. There are a lot of extra punctuation marks that are identified as separate words and appear in the processed text. Add an image
For this I use the following command:
Data_preprocessing['clean_custom_content_tokenize'] = Data_preprocessing['tweet_without_stopwords'].apply(nltk.word_tokenize)
As you can see, there are many characters like dashes, colons, etc.The question immediately pops up, why not apply the removal of punctuation before tokenization. The point is that there are decimal values in the text that I need. Removing punctuation marks before tokenization splits them into two words, which is not correct.
An example of what happens when you remove punctuation marks before tokenization:
custom_pipeline2 = [preprocessing.remove_punctuation]
Data_preprocessing['clean_custom_content_tokenize'] = Data_preprocessing['tweet_without_stopwords'].pipe(hero.clean, custom_pipeline2)
Data_preprocessing['clean_custom_content_tokenize'] = Data_preprocessing['clean_custom_content_tokenize'].apply(nltk.word_tokenize)
I have found a couple of examples how to solve my sign problem but when the data is not a data frame but a string. Can you somehow customize nltk tokenization? Or use some kind of regular expression to process the resulting list later?