0

I have some text that I was able to process from stop words, links, emoticons, etc. After tokenizing my dataframe, I get a not-so-good picture. There are a lot of extra punctuation marks that are identified as separate words and appear in the processed text. Add an image enter image description here

For this I use the following command:

Data_preprocessing['clean_custom_content_tokenize'] = Data_preprocessing['tweet_without_stopwords'].apply(nltk.word_tokenize)

As you can see, there are many characters like dashes, colons, etc.The question immediately pops up, why not apply the removal of punctuation before tokenization. The point is that there are decimal values in the text that I need. Removing punctuation marks before tokenization splits them into two words, which is not correct.

An example of what happens when you remove punctuation marks before tokenization:

custom_pipeline2 = [preprocessing.remove_punctuation]
Data_preprocessing['clean_custom_content_tokenize'] = Data_preprocessing['tweet_without_stopwords'].pipe(hero.clean, custom_pipeline2)
Data_preprocessing['clean_custom_content_tokenize'] = Data_preprocessing['clean_custom_content_tokenize'].apply(nltk.word_tokenize)

enter image description here

I have found a couple of examples how to solve my sign problem but when the data is not a data frame but a string. Can you somehow customize nltk tokenization? Or use some kind of regular expression to process the resulting list later?

kostya ivanov
  • 613
  • 5
  • 15
  • Please refer to his https://stackoverflow.com/questions/15547409/how-to-get-rid-of-punctuation-using-nltk-tokenizer – Gedas Miksenas Nov 04 '21 at 10:15
  • @Gedas Miksenas, yes, I found this post, but it doesn't solve my problem with decimal numbers. They are also split into two separate numbers. – kostya ivanov Nov 04 '21 at 10:22
  • @Gedas Miksenas, most likely it will be more correct to use a regular expression after tokenization, but here's how to write it correctly, I have no idea – kostya ivanov Nov 04 '21 at 10:28

1 Answers1

1
Data_preprocessing['clean_custom_content_tokenize'] = Data_preprocessing['clean_custom_content_tokenize'].apply(lambda x: re.sub(r"(, '[\W\.]')",r"", str(x)))
kostya ivanov
  • 613
  • 5
  • 15
  • 2
    While this code may answer the question, it would be better to include some _context_, explaining _how_ it works and _when_ to use it. Code-only answers are not useful in the long run. – PCM Nov 04 '21 at 12:28