2

new user to R so please go easy on me.

I have dataframe like:

   df = data.frame(Mineral = c("Zfeldspar", "Zgranite", "ZSilica"),
                     Confidence = c("ZLow", "High", "Med"),
                     Coverage = c("sub", "sub", "super"),
                     Aspect = c("ZPos", "ZUnd", "Neg"))

actual file is much larger and outputted from old hardware. For some reason some entries have "Z" put in front of them. How do I remove from entire dataset?

I tried df = gsub("Z", " ", df) but it just gives me nonsense. This darn thing!

[1] "1:3" "c(3, 1, 2)" "c(1, 1, 2)" "c(2, 3, 1)"

Looked on here at stackoverflow and tried stringr package but could also not get to work. Anyone know what to do?

smci
  • 32,567
  • 20
  • 113
  • 146
rockhound
  • 75
  • 5
  • Never call your df `data`, that shadows the builtin [`utils::data`()](https://stat.ethz.ch/R-manual/R-devel/library/utils/html/data.html) – smci Apr 19 '18 at 22:53
  • In your regex you want '^Z', to only match leading 'Z', not inside the string – smci Apr 19 '18 at 22:56
  • huh, okay thanks I will be sure to learn for next time! – rockhound Apr 19 '18 at 23:05
  • Also I posted you a solution how to do it in stringr(/stringi) package, to avoid getting the unwanted vector of indices you got. They will be more performant than base calls. – smci Apr 19 '18 at 23:09
  • "This darn thing" <3 – jjj Mar 22 '23 at 13:42

5 Answers5

3

Your approach with gsub() is not working because that function operates on vectors, and not dataframes. However, you can apply gsub() over each column of your dataframe to get what you want:

df[] <- lapply(df, function (x) {gsub("Z", "", x)})

For a stringr solution (that also uses dplyr), try:

library(tidyverse)

df <- mutate_all(df,
                   funs(str_replace_all(., "Z", "")))

P.S. I recommend using df <- instead of df = in the future. Good luck!

EDIT: corrected typo - thanks @thelatemail

Marcus Campbell
  • 2,746
  • 4
  • 22
  • 36
  • Thanks! Thank you for stringr solution too – rockhound Apr 19 '18 at 23:07
  • What is your reason of recommending <- instead of =? – Onyambu Apr 19 '18 at 23:22
  • Good question! To be perfectly fair, it's a matter of opinion, but in my experience I've found that `<-` often results in less confusion for complete beginners to programming. I think this post does a good job of covering the arguments for both: https://stackoverflow.com/questions/1741820/what-are-the-differences-between-and-in-r – Marcus Campbell Apr 19 '18 at 23:27
  • `data <- lapply(data, function (x) {gsub("Z", "", x)})` won't work as intended - you want `data[] <- ...` otherwise you will get a list as output. – thelatemail Apr 20 '18 at 00:21
  • That's what I get for not editing this one. Thanks @thelatemail – Marcus Campbell Apr 20 '18 at 00:36
1

You may use a simple ^Z regex in the following way:

df = data.frame(Mineral = c("Zfeldspar", "Zgranite", "ZSilica"),
                      Confidence = c("ZLow", "High", "Med"),
                      Coverage = c("sub", "sub", "super"),
                      Aspect = c("ZPos", "ZUnd", "Neg"))
df[] <- lapply(df, sub, pattern = '^Z',  replacement ="")
> df
   Mineral Confidence Coverage Aspect
1 feldspar        Low      sub    Pos
2  granite       High      sub    Und
3   Silica        Med    super    Neg

The ^Z pattern matches the start of the string with ^ anchor, and then Z is matched and removed using sub (as there is only one possible match in the each string there is no point using gsub).

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
0

You are close. If you want to go with base gsub

data$Mineral = gsub("Z", "", data$Mineral)

You can do this for all columns. Or use a combination of apply strategies (see other answers!)

PS. Naming your data data is not a good idea. At least do my_data

Matias Andina
  • 4,029
  • 4
  • 26
  • 58
0

You could do:

as.data.frame(sapply(data, function(x) {gsub("Z", "", x)}))
Lennyy
  • 5,932
  • 2
  • 10
  • 23
0

You asked how to do it in stringr(/stringi) package, to avoid getting the unwanted vector of indices you got:

> as.data.frame(apply(df, 2,
      function(col) stringr::str_replace_all(col, '^Z', '')))
> as.data.frame(apply(df, 2,
      function(col) stringi::stri_replace_first_regex(col, '^Z', '')))

   Mineral Confidence Coverage Aspect
1 feldspar        Low      sub    Pos
2  granite       High      sub    Und
3   Silica        Med    super    Neg

(where the as.data.frame() call is needed to turn the output array back into a df R: apply-like function that returns a data frame? )

As to figuring out how exactly to call str*_replace function over an entire dataframe, I tried...

  • the entire df: stri_replace_first_fixed(df, '^Z', '')
  • by rows: stri_replace_first_fixed(df[1,], '^Z', '')
  • by columns: stri_replace_first_fixed(df[,1], '^Z', '')

Only the last one works properly. Admittedly a design flaw on str*_replace, they should at minimum recognize an invalid object and produce a useful error message, instead of spewing out indices.

smci
  • 32,567
  • 20
  • 113
  • 146
  • wow thank you. did not know I could use function(col) inside apply. That's a nice trick! – rockhound Apr 19 '18 at 23:09
  • You can use any (package or user-defined) function inside apply/sapply/lapply. Here I passed in an anonymous function I defined which runs the relevant str* function on a column. I called the function argument `col` but you can call it whatever you want. – smci Apr 19 '18 at 23:11
  • As to figuring out how exactly to call str*_replace function on a dataframe, I tried the entire df: `stri_replace_first_fixed(df, '^Z', '')` by rows: `stri_replace_first_fixed(df[1,], '^Z', '') and by columns: `stri_replace_first_fixed(df[,1], '^Z', '')` – smci Apr 19 '18 at 23:13