only remove punctuation for words not numbers

Question

I am using the tm package in R to remove punctuation.

TextDoc <- tm_map(TextDoc, removePunctuation)

Is there a way I can only remove puncutation if it has to do with a letter/word instead of a number?

E.g.

I want performance. --> performance But I want 3.14 --> 3.14

Example of how i want function to work:

wall, --> wall
expression. --> expression
ef. --> ef
A. --> A
name: --> name
:ok --> ok

91.8.10 --> 91.8.10

EDIT:

TextDoc is of the form:

How exactly do you define "if it has to do with a letter/word"? It would be helpful if you could include more test cases in a [reproducible data format](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) that can be used for testing. — MrFlick, Jun 11 '21 at 06:51

AnilGoyal · Answer 1 · 2021-06-13T07:29:05.483

You may also try this gsub('(?<!\\d)[[:punct:]](?=\\D)?', '', text, perl = T) where text is your text vector. Explanation of regex

(?<!\\d) negative lookbehind for any digit character
[[:punct:]] searches for punctuation marks
(?=\\D) followed by positive lookahead for any non-digit character
? 0 or once
check this for regex demo

text <- c("wall, 88.1", "expression.", "ef.", ":ok", "A.", "3.14", "91.8.10")

gsub('(?<!\\d)[[:punct:]](?=\\D)?', '', text, perl = T)
#> [1] "wall 88.1"  "expression" "ef"         "ok"         "A"         
#> [6] "3.14"       "91.8.10"


long_text <- "wall, 88.1 expression. ef. :ok A. 3.14 91.8.10"

gsub('(?<!\\d)[[:punct:]](?=\\D)?', '', long_text, perl = T)
#> [1] "wall 88.1 expression ef ok A 3.14 91.8.10"

^{Created on 2021-06-13 by the reprex package (v2.0.0)}

Fons MA · Accepted Answer · 2021-06-13T10:50:50.727

I've completely revamped my answer based on your specification and Anil's answer below, which is much more widely applicable than what I originally had.

library(tm)

# Here we pretend that your texts are like this
text <- c("wall,", "expression.", "ef.", ":ok", "A.", "3.14", "91.8.10",
          "w.a.ll, 6513.645+1646-5")

# and we create a corpus with them, like the one you show
corp <- Corpus(VectorSource(text))

# you create a function with any of the solutions that we've provided here
# I'm taking AnilGoyal's because it's better than my rushed purrr one.

my_remove_punct <- function(x) {
  
  gsub('(?<!\\d)[[:punct:]](?=\\D)?', '', x, perl = T)

}

# pass the function to tm_map
new_corp <- tm_map(corp, my_remove_punct)

# Applying the function will give you a warning about dropping documents; but it's a bug of the TM package.


# We use this to confirm that the contents are indeed correct. The last line is a print-out of all the individual documents together.
sapply(new_corp, print)
#> [1] "wall"
#> [1] "expression"
#> [1] "ef"
#> [1] "ok"
#> [1] "A"
#> [1] "3.14"
#> [1] "91.8.10"
#> [1] "wall 6513.645+1646-5"
#> [1] "wall"                 "expression"           "ef"                  
#> [4] "ok"                   "A"                    "3.14"                
#> [7] "91.8.10"              "wall 6513.645+1646-5"

The warning you receive about "dropping documents" is not real as you can see by printing. An explanation is in this other SO question.

In the future, note that you can quickly get better answers by providing raw data with the function dput to your object. Something like dput(TextDoc). If it is too much, you can subset it.

thanks for the help, how can I apply your code to the form my dataset TextDoc is in? — user11015000, Jun 13 '21 at 05:19
@user11015000, updated now by actually using the `tm` package. — Fons MA, Jun 13 '21 at 10:51

score 1 · Answer 3 · answered Jun 11 '21 at 09:08

Tried to make it less ugly but here is my best shot:

library(data.table)

TextDoc <- data.table(text = c("wall",
                      "expression.", 
                      "ef.",
                      "91.8.10",
                      "A.", 
                      "name:", 
                      ":ok"))

TextDoc[grepl("[a-zA-Z]", text), 
      text := unlist(tm_map(Corpus(VectorSource(as.vector(text))), removePunctuation))[1:length(grepl("[a-zA-Z]", text))]]

Which gives us:

> TextDoc
         text
1:       wall
2: expression
3:         ef
4:    91.8.10
5:          A
6:       name
7:         ok

only remove punctuation for words not numbers

3 Answers3