5

Question:

How can you use R to remove all special characters from a dataframe, quickly and efficiently?

Progress:

This SO post details how to remove special characters. I can apply the gsub function to single columns (images 1 and 2), but not the entire dataframe.

Problem:

My dataframe consists of 100+ columns of integers, string, etc. When I try to run the gsub on the dataframe, it doesn't return the output I desire. Instead, I get what's shown in image 3.

df <- read.csv("C:/test.csv")
dfa <- gsub("[[:punct:]]", "", df$a) #this works on a single column
dfb <- gsub("[[:punct:]]", "", df$b) #this works on a single column
df_all <- gsub("[[:punct:]]", "", df) #this does not work on the entire df
View(df_all)

df - This is the original dataframe:

Original dataframe

dfa - This is gsub applied to column b. Good!

gsub applied to column b

df_all - This is gsub applied to the entire dataframe. Bad!

gsub applied to entire dataframe

Summary:

Is there a way to gsub an entire dataframe? Else, should an apply function be used instead?

PizzaAndCode
  • 340
  • 1
  • 3
  • 12

3 Answers3

6

Here is a possible solution using dplyr:

# Example data
bla <- data.frame(a = c(1,2,3), 
              b = c("fefa%^%", "fes^%#$%", "gD%^E%Ewfseges"), 
              c = c("%#%$#^#", "%#$#%@", ",.,gdgd$%,."))

# Use mutate_all from dplyr
bla %>%
  mutate_all(funs(gsub("[[:punct:]]", "", .)))

  a           b    c
1 1        fefa     
2 2         fes     
3 3 gDEEwfseges gdgd

Update:

mutate_all has been superseded, and funs is deprecated as of dplyr 0.8.0. Here is an updated solution using mutate and across:

# Example data
df <- data.frame(a = c(1,2,3), 
                 b = c("fefa%^%", "fes^%#$%", "gD%^E%Ewfseges"), 
                 c = c("%#%$#^#", "%#$#%@", ",.,gdgd$%,."))

# Use mutate_all from dplyr
df %>%
  mutate(across(everything(), ~gsub("[[:punct:]]", "", .x)))
Rich Pauloo
  • 7,734
  • 4
  • 37
  • 69
Ryan
  • 281
  • 3
  • 12
3

Another solution is to convert the data frame to a matrix first then run the gsub and then convert back to a data frame as follows:

as.data.frame(gsub("[[:punct:]]", "", as.matrix(df))) 
GordonShumway
  • 1,980
  • 13
  • 19
2

I like Ryan's answer using dplyr. As mutate_all and funs are now deprecated, here is my suggested updated solution using mutate and across:

# Example data
df <- data.frame(a = c(1,2,3), 
                 b = c("fefa%^%", "fes^%#$%", "gD%^E%Ewfseges"), 
                 c = c("%#%$#^#", "%#$#%@", ",.,gdgd$%,."))

# Use across() from dplyr
df %>%
  mutate(across(everything(), ~gsub("[[:punct:]]", "", .x)))

  a           b    c
1 1        fefa     
2 2         fes     
3 3 gDEEwfseges gdgd
Rich Pauloo
  • 7,734
  • 4
  • 37
  • 69
butterflyeffect
  • 85
  • 1
  • 12