0

I am using the tm package in R to remove punctuation.

TextDoc <- tm_map(TextDoc, removePunctuation)

Is there a way I can only remove puncutation if it has to do with a letter/word instead of a number?

E.g.

I want performance. --> performance But I want 3.14 --> 3.14

Example of how i want function to work:

wall, --> wall
expression. --> expression
ef. --> ef
A. --> A
name: --> name
:ok --> ok

91.8.10 --> 91.8.10

EDIT:

TextDoc is of the form: form of textdoc

Waldi
  • 39,242
  • 6
  • 30
  • 78
user11015000
  • 151
  • 1
  • 15
  • How exactly do you define "if it has to do with a letter/word"? It would be helpful if you could include more test cases in a [reproducible data format](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) that can be used for testing. – MrFlick Jun 11 '21 at 06:51
  • updated with examples – user11015000 Jun 11 '21 at 06:53

3 Answers3

5

You may also try this gsub('(?<!\\d)[[:punct:]](?=\\D)?', '', text, perl = T) where text is your text vector. Explanation of regex

  • (?<!\\d) negative lookbehind for any digit character
  • [[:punct:]] searches for punctuation marks
  • (?=\\D) followed by positive lookahead for any non-digit character
  • ? 0 or once
  • check this for regex demo
text <- c("wall, 88.1", "expression.", "ef.", ":ok", "A.", "3.14", "91.8.10")

gsub('(?<!\\d)[[:punct:]](?=\\D)?', '', text, perl = T)
#> [1] "wall 88.1"  "expression" "ef"         "ok"         "A"         
#> [6] "3.14"       "91.8.10"


long_text <- "wall, 88.1 expression. ef. :ok A. 3.14 91.8.10"

gsub('(?<!\\d)[[:punct:]](?=\\D)?', '', long_text, perl = T)
#> [1] "wall 88.1 expression ef ok A 3.14 91.8.10"

Created on 2021-06-13 by the reprex package (v2.0.0)

AnilGoyal
  • 25,297
  • 4
  • 27
  • 45
3

I've completely revamped my answer based on your specification and Anil's answer below, which is much more widely applicable than what I originally had.

library(tm)

# Here we pretend that your texts are like this
text <- c("wall,", "expression.", "ef.", ":ok", "A.", "3.14", "91.8.10",
          "w.a.ll, 6513.645+1646-5")

# and we create a corpus with them, like the one you show
corp <- Corpus(VectorSource(text))

# you create a function with any of the solutions that we've provided here
# I'm taking AnilGoyal's because it's better than my rushed purrr one.

my_remove_punct <- function(x) {
  
  gsub('(?<!\\d)[[:punct:]](?=\\D)?', '', x, perl = T)

}

# pass the function to tm_map
new_corp <- tm_map(corp, my_remove_punct)

# Applying the function will give you a warning about dropping documents; but it's a bug of the TM package.


# We use this to confirm that the contents are indeed correct. The last line is a print-out of all the individual documents together.
sapply(new_corp, print)
#> [1] "wall"
#> [1] "expression"
#> [1] "ef"
#> [1] "ok"
#> [1] "A"
#> [1] "3.14"
#> [1] "91.8.10"
#> [1] "wall 6513.645+1646-5"
#> [1] "wall"                 "expression"           "ef"                  
#> [4] "ok"                   "A"                    "3.14"                
#> [7] "91.8.10"              "wall 6513.645+1646-5"

The warning you receive about "dropping documents" is not real as you can see by printing. An explanation is in this other SO question.

In the future, note that you can quickly get better answers by providing raw data with the function dput to your object. Something like dput(TextDoc). If it is too much, you can subset it.

Fons MA
  • 1,142
  • 1
  • 12
  • 21
1

Tried to make it less ugly but here is my best shot:

library(data.table)

TextDoc <- data.table(text = c("wall",
                      "expression.", 
                      "ef.",
                      "91.8.10",
                      "A.", 
                      "name:", 
                      ":ok"))

TextDoc[grepl("[a-zA-Z]", text), 
      text := unlist(tm_map(Corpus(VectorSource(as.vector(text))), removePunctuation))[1:length(grepl("[a-zA-Z]", text))]]   

Which gives us:

> TextDoc
         text
1:       wall
2: expression
3:         ef
4:    91.8.10
5:          A
6:       name
7:         ok
  
koolmees
  • 2,725
  • 9
  • 23