How to use regex over entire dataframe in R

Question

new user to R so please go easy on me.

I have dataframe like:

   df = data.frame(Mineral = c("Zfeldspar", "Zgranite", "ZSilica"),
                     Confidence = c("ZLow", "High", "Med"),
                     Coverage = c("sub", "sub", "super"),
                     Aspect = c("ZPos", "ZUnd", "Neg"))

actual file is much larger and outputted from old hardware. For some reason some entries have "Z" put in front of them. How do I remove from entire dataset?

I tried df = gsub("Z", " ", df) but it just gives me nonsense. This darn thing!

[1] "1:3" "c(3, 1, 2)" "c(1, 1, 2)" "c(2, 3, 1)"

Looked on here at stackoverflow and tried stringr package but could also not get to work. Anyone know what to do?

Never call your df `data`, that shadows the builtin [`utils::data`()](https://stat.ethz.ch/R-manual/R-devel/library/utils/html/data.html) — smci, Apr 19 '18 at 22:53
In your regex you want '^Z', to only match leading 'Z', not inside the string — smci, Apr 19 '18 at 22:56
Also I posted you a solution how to do it in stringr(/stringi) package, to avoid getting the unwanted vector of indices you got. They will be more performant than base calls. — smci, Apr 19 '18 at 23:09

Marcus Campbell · Accepted Answer · 2018-04-20T00:35:31.510

3

Your approach with gsub() is not working because that function operates on vectors, and not dataframes. However, you can apply gsub() over each column of your dataframe to get what you want:

df[] <- lapply(df, function (x) {gsub("Z", "", x)})

For a stringr solution (that also uses dplyr), try:

library(tidyverse)

df <- mutate_all(df,
                   funs(str_replace_all(., "Z", "")))

P.S. I recommend using df <- instead of df = in the future. Good luck!

EDIT: corrected typo - thanks @thelatemail

edited Apr 20 '18 at 00:35

answered Apr 19 '18 at 22:54

Marcus Campbell

2,746
4
22
36

Thanks! Thank you for stringr solution too – rockhound Apr 19 '18 at 23:07
What is your reason of recommending <- instead of =? – Onyambu Apr 19 '18 at 23:22
Good question! To be perfectly fair, it's a matter of opinion, but in my experience I've found that `<-` often results in less confusion for complete beginners to programming. I think this post does a good job of covering the arguments for both: https://stackoverflow.com/questions/1741820/what-are-the-differences-between-and-in-r – Marcus Campbell Apr 19 '18 at 23:27
`data <- lapply(data, function (x) {gsub("Z", "", x)})` won't work as intended - you want `data[] <- ...` otherwise you will get a list as output. – thelatemail Apr 20 '18 at 00:21
That's what I get for not editing this one. Thanks @thelatemail – Marcus Campbell Apr 20 '18 at 00:36

Wiktor Stribiżew · Answer 2 · 2018-04-19T23:03:25.127

1

You may use a simple ^Z regex in the following way:

df = data.frame(Mineral = c("Zfeldspar", "Zgranite", "ZSilica"),
                      Confidence = c("ZLow", "High", "Med"),
                      Coverage = c("sub", "sub", "super"),
                      Aspect = c("ZPos", "ZUnd", "Neg"))
df[] <- lapply(df, sub, pattern = '^Z',  replacement ="")
> df
   Mineral Confidence Coverage Aspect
1 feldspar        Low      sub    Pos
2  granite       High      sub    Und
3   Silica        Med    super    Neg

The ^Z pattern matches the start of the string with ^ anchor, and then Z is matched and removed using sub (as there is only one possible match in the each string there is no point using gsub).

edited Apr 19 '18 at 23:03

answered Apr 19 '18 at 22:52

Wiktor Stribiżew

607,720
39
448
563

what is a ^7 regex? – rockhound Apr 19 '18 at 23:05
did not know I could just use sub...this is very interesting thank you – rockhound Apr 19 '18 at 23:06
@rockhound See [**this regex demo**](https://regex101.com/r/dpVnb5/1), there is also a more technical information about the pattern. – Wiktor Stribiżew Apr 19 '18 at 23:09

score 0 · Answer 3 · answered Apr 19 '18 at 22:52

0

You are close. If you want to go with base gsub

data$Mineral = gsub("Z", "", data$Mineral)

You can do this for all columns. Or use a combination of apply strategies (see other answers!)

PS. Naming your data data is not a good idea. At least do my_data

answered Apr 19 '18 at 22:52

Matias Andina

4,029
4
26
58

score 0 · Answer 4 · answered Apr 19 '18 at 22:53

0

You could do:

as.data.frame(sapply(data, function(x) {gsub("Z", "", x)}))

answered Apr 19 '18 at 22:53

Lennyy

5,932
2
10
23

smci · Answer 5 · 2018-04-19T23:15:01.610

0

You asked how to do it in stringr(/stringi) package, to avoid getting the unwanted vector of indices you got:

> as.data.frame(apply(df, 2,
      function(col) stringr::str_replace_all(col, '^Z', '')))
> as.data.frame(apply(df, 2,
      function(col) stringi::stri_replace_first_regex(col, '^Z', '')))

   Mineral Confidence Coverage Aspect
1 feldspar        Low      sub    Pos
2  granite       High      sub    Und
3   Silica        Med    super    Neg

(where the as.data.frame() call is needed to turn the output array back into a df R: apply-like function that returns a data frame? )

As to figuring out how exactly to call str*_replace function over an entire dataframe, I tried...

the entire df: stri_replace_first_fixed(df, '^Z', '')
by rows: stri_replace_first_fixed(df[1,], '^Z', '')
by columns: stri_replace_first_fixed(df[,1], '^Z', '')

Only the last one works properly. Admittedly a design flaw on str*_replace, they should at minimum recognize an invalid object and produce a useful error message, instead of spewing out indices.

edited Apr 19 '18 at 23:15

answered Apr 19 '18 at 23:07

smci

32,567
20
113
146

wow thank you. did not know I could use function(col) inside apply. That's a nice trick! – rockhound Apr 19 '18 at 23:09
You can use any (package or user-defined) function inside apply/sapply/lapply. Here I passed in an anonymous function I defined which runs the relevant str* function on a column. I called the function argument `col` but you can call it whatever you want. – smci Apr 19 '18 at 23:11
As to figuring out how exactly to call str*_replace function on a dataframe, I tried the entire df: `stri_replace_first_fixed(df, '^Z', '')` by rows: `stri_replace_first_fixed(df[1,], '^Z', '') and by columns: `stri_replace_first_fixed(df[,1], '^Z', '')` – smci Apr 19 '18 at 23:13

How to use regex over entire dataframe in R

5 Answers5