Removing punctuation marks in tokenization nltk with dataframe (python)

Question

I have some text that I was able to process from stop words, links, emoticons, etc. After tokenizing my dataframe, I get a not-so-good picture. There are a lot of extra punctuation marks that are identified as separate words and appear in the processed text. Add an image

For this I use the following command:

Data_preprocessing['clean_custom_content_tokenize'] = Data_preprocessing['tweet_without_stopwords'].apply(nltk.word_tokenize)

As you can see, there are many characters like dashes, colons, etc.The question immediately pops up, why not apply the removal of punctuation before tokenization. The point is that there are decimal values in the text that I need. Removing punctuation marks before tokenization splits them into two words, which is not correct.

An example of what happens when you remove punctuation marks before tokenization:

custom_pipeline2 = [preprocessing.remove_punctuation]
Data_preprocessing['clean_custom_content_tokenize'] = Data_preprocessing['tweet_without_stopwords'].pipe(hero.clean, custom_pipeline2)
Data_preprocessing['clean_custom_content_tokenize'] = Data_preprocessing['clean_custom_content_tokenize'].apply(nltk.word_tokenize)

I have found a couple of examples how to solve my sign problem but when the data is not a data frame but a string. Can you somehow customize nltk tokenization? Or use some kind of regular expression to process the resulting list later?

Please refer to his https://stackoverflow.com/questions/15547409/how-to-get-rid-of-punctuation-using-nltk-tokenizer — Gedas Miksenas, Nov 04 '21 at 10:15
@Gedas Miksenas, yes, I found this post, but it doesn't solve my problem with decimal numbers. They are also split into two separate numbers. — kostya ivanov, Nov 04 '21 at 10:22
@Gedas Miksenas, most likely it will be more correct to use a regular expression after tokenization, but here's how to write it correctly, I have no idea — kostya ivanov, Nov 04 '21 at 10:28

score 1 · Answer 1 · answered Nov 04 '21 at 12:04

1

Data_preprocessing['clean_custom_content_tokenize'] = Data_preprocessing['clean_custom_content_tokenize'].apply(lambda x: re.sub(r"(, '[\W\.]')",r"", str(x)))

answered Nov 04 '21 at 12:04

kostya ivanov

613
5
15

2

While this code may answer the question, it would be better to include some _context_, explaining _how_ it works and _when_ to use it. Code-only answers are not useful in the long run. – PCM Nov 04 '21 at 12:28

Removing punctuation marks in tokenization nltk with dataframe (python)

1 Answers1