Removing an element based on its content?

Question

df.cleaned <- df[-which(str_detect(df, "Not found")),]

"df" refers to a data frame, that consists of multiple columns and rows. A lot of the elements in this data frame have certain character words in them.

What I'm looking to do, is to remove all those values that contain the words "Not found" either as the whole element value, or part of it.

So far, the above command is what I've come up with, with the stringr package. However, this command seems to remove entire rows. I don't want to remove the entire row, I simply want to remove that specific element that contains "Not found".

Have a look at https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example in order to understand need to provide minimal info with question. — MKR, Mar 20 '18 at 21:07
Hello MKR. I'm not quite sure how to reproduce a sample on this website. It's basically a CSV file that I have imported to R through the readr package, and then the columns and rows are filled with certain words. I'm still new to R so I'm sorry if it's a bit lackluster. Let me know what I can do to improve! — WoeIs, Mar 20 '18 at 21:15
When you say *"remove values that contain the words"*, you do mean remove the whole row, don't you? Otherwise, you are very likely to remove more (and differently) from one column as from another. — r2evans, Mar 20 '18 at 21:17
As far as I understand your question correclty you simply want to replace the string "Not found" by nothing. In this case you might consider something like: `df["mycolumnname"] <- gsub("Not found", "", df["mycolumnname"])`. — Manuel Bickel, Mar 20 '18 at 21:19
@ManuelBickel you can do that for the entire data.frame all together, as described in my answer. — De Novo, Mar 20 '18 at 21:21
@Woels learning to reproduce a sample on this website is a key part of becoming a good question asker :) Read the link in the first comment. A good place to start is using `dput()` on the minimal object that will reproduce your problem. People donate their time to solve your question. Make it easier for them. — De Novo, Mar 20 '18 at 21:24
@r2evans: I'm more accustomed to Excel so I will use that as an example. Say that you have a spreadsheet and different cells have "Not found" written in them, for instance cell G5. I want cell G5 to just have an empty value or cleared, not the entire row 5 deleted. That's kinda the thought behind it, but of course done in R. I'm sorry for the confusion, I'm still new to R so I'm not very articulate in describing R yet, but I'm learning. :) — WoeIs, Mar 20 '18 at 21:34
@Manuel Bickel: Yes that is correct. I have not run into gsub before, so I will have a look at that in the R documentation. However, I want the code to be applied to the entire data frame and not just one column, hence my problem! — WoeIs, Mar 20 '18 at 21:37
@Dan Hall: Thank you for the suggestion and also for the answer below! I'm trying to understand your code as I'm writing this. And you're absolutely right, I definitely should get better at explaining my problem. I have not seen the dput command yet, I will definitely look into it and see if I can use it for any future questions I have, thank you so much! :) — WoeIs, Mar 20 '18 at 21:38

De Novo · Answer 1 · 2018-03-20T22:34:31.707

How to get the behavior:

toy[toy == "Not found"] <- ""
toy
#    x y z  n
# 1  m   f  6
# 2  z t a  3
# 3    m    4
# 4    j    9
# 5  e      5
# 6  f n k  2
# 7  q f p  1
# 8      n  8
# 9  n k h  7
# 10 d u l 10

For matching vs. equality, you could try this. I'm not sure if it offers performance improvements over the @r2evans approach. EDIT: apparently, as @r2evans explains in the comments, the same conversion is done behind the scenes. In which case, it doesn't look as clean as the equality solution, but shouldn't drop in performance due to the conversion:

toy[matrix(grepl("Not found", as.matrix(toy)), nrow(toy))] <- ""
toy
#    x y z  n
# 1  m   f  6
# 2  z t a  3
# 3    m    4
# 4    j    9
# 5  e      5
# 6  f n k  2
# 7  q f p  1
# 8      n  8
# 9  n k h  7
# 10 d u l 10

Create the data:

toy <- data.frame(x = sample(letters, 10), y = sample(letters, 10), z = sample(letters, 10), stringsAsFactors = FALSE)
for (col in seq_along(toy)) toy[[col]][sample(10, 3)] <- "Not found"  
toy$n <- sample(10)
toy
#            x         y         z  n
# 1          m Not found         f  6
# 2          z         t         a  3
# 3  Not found         m Not found  4
# 4  Not found         j Not found  9
# 5          e Not found Not found  5
# 6          f         n         k  2
# 7          q         f         p  1
# 8  Not found Not found         n  8
# 9          n         k         h  7
# 10         d         u         l 10

That was a suprisingly simple but good code! However I noticed that it only removes the element value if the element has "Not found only written in it. I should have probably been more clear but if some elements have the value "Not found (12)" then it should remove that as well since "Not found" is still a part of the element. I'm assuming that I will have to use str_detect? — WoeIs, Mar 20 '18 at 21:58
BTW: this is not preserving the *"structure of the object"* as you've stated a few times here. In fact, when doing this subsetting, it is silently converting to a matrix internally. (Your recent edit does this explicitly, too.) — r2evans, Mar 20 '18 at 22:20
By *using* the structure, I don't mean preserving the structure. In general, when i have my wits about me, I try to avoid loops when the structure (i.e., length, dimensions) allows R functions to operate in one statement. This is an extension of doing `new <- x * y` for an `x` with length, instead of `for (i in seq_along(x)) new[i] <- x[i] * y` — De Novo, Mar 20 '18 at 22:24
@Dan Hall: You're right! Thank you for the explanation, I have a ton of reading to do thanks to all your answers! — WoeIs, Mar 20 '18 at 22:28

r2evans · Answer 2 · 2018-03-20T21:54:17.717

1

It's often advantageous to write a simple function up front that does what you want, and then know how to apply that function to all of your columns.

For instance:

replace_notfound <- function(s, newstr="") s[grepl("Not found", s)] <- newstr

Now, let's apply that function to each column of your data:

# I'm assuming you want stringsAsFactors=FALSE
df.cleaned <- as.data.frame(lapply(df, replace_notfound), stringsAsFactors=FALSE)

It's not always the case that all columns of a frame are character, so you might want to conditionally do this:

ischr <- sapply(df, is.character)
df.cleaned <- df # just a copy
df.cleaned[ischr] <- lapply(df.cleaned[ischr], replace_notfound)

edited Mar 20 '18 at 21:54

answered Mar 20 '18 at 21:15

r2evans

141,215
6
77
149

Thank you a lot for the advice. You're definitely correct about finding out a simple function first! I have heard of the "grep" function but I'm not quite sure of the "grepl" function yet. From what I can read, grepl returns a TRUE/FALSE statement to us, which in this case is useful so values that are "Not found" are TRUE in our logic vector, and then we work only on the elements with the TRUE value, is that correctly understood? Regarding the columns not being characters, would this command not suffice? df <- as.character(df) – WoeIs Mar 20 '18 at 21:40
`grep` returns zero or more integers, `grepl` returns as many logicals as there are values. On advantage (in many situations) of `grepl` over `grep` is that as long as the data has length, then the output of `grepl` has length (which happens to be the same length); often I see the use of `grep` in situations that do not robustly handle zero-length returns. (Oops, I just fixed `!` logic in my answer from previous edit.) – r2evans Mar 20 '18 at 21:54
I like the idea of writing a function as `replace_notfound` but in real world there will be too many such functions, I hope. – MKR Mar 20 '18 at 21:58
@DanHall, your answer works well (simpler/faster) when equality is sufficient, but unfortunately the OP is looking for more. In my answer, I hope to introduce a little "method" to the madness, too, where small functions and their purpose are instantly clear and easily applied to simple data structures (like vectors). For instance, it's not hard to extend this trivial model (`ischr`, `df[ischr] <- lapply(df[ischr],...)`) to do different "somethings" to multiple columns based on different criteria. – r2evans Mar 20 '18 at 22:14
@r2evans we can still use the structure of the object to assign (instead of looping), but it's at the performance cost of converting the index object to a matrix. I'm curious.. might do a benchmark to see which is better. (see my edit above) – De Novo Mar 20 '18 at 22:22
@r2evans: Thank you for the explanation. Do you happen to have any links that shows some good examples or tutorials of grep and grepl? For my own understanding :) – WoeIs Mar 20 '18 at 22:26
Perhaps the easiest: [`?grep`](http://stat.ethz.ch/R-manual/R-devel/library/base/html/grep.html) (it has working examples at the end). A quick web-search of "R grep tutorial" will return several, I'm sure, though I don't know them personally (so I won't suggest them blindly). BTW: `grepl` and `stringr::str_detect` are fairly identical in their behavior and output. – r2evans Mar 20 '18 at 22:29
@r2evans: I never noticed the examples at the end of the ?grep command. I will have a look at them, thank you! – WoeIs Mar 20 '18 at 22:34

MKR · Accepted Answer · 2018-03-20T21:45:50.647

0

Your thought was in right direction. You need to try to apply it for each item. One option could be to use sapply. Check every item with str_detect and replace with "" or NA otherwise just return value of item.

library(stringr)
df.clean <- as.dataframe(sapply(df, 
                   function(x)ifelse(str_detect(x, "Not found"), "",x)))
df.clean
# 
# A    B
# 1  A Good
# 2  B     
# 3  C Good
# 4  D     
# 5  E Good
# 6  A     
# 7  B Good
# 8  C     
# 9  D Good
# 10 E

Data

    df <- data.frame(A = rep(c("A", "B", "C", "D", "E"), 2), 
                     B = rep(c("Good","Bad with Not found"),5),
                      stringsAsFactors = FALSE)
df
# A                  B
# 1  A               Good
# 2  B Bad with Not found
# 3  C               Good
# 4  D Bad with Not found
# 5  E               Good
# 6  A Bad with Not found
# 7  B               Good
# 8  C Bad with Not found
# 9  D               Good
# 10 E Bad with Not found

edited Mar 20 '18 at 21:45

answered Mar 20 '18 at 21:39

MKR

19,739
4
23
33

Why loop through it when you can just use the structure of the object? – De Novo Mar 20 '18 at 21:53
@DanHall OP's requirement is not "equality" (`==`) rather he has written as: _contain the words "Not found" either as the whole element value, or part of it._ – MKR Mar 20 '18 at 21:57
@DanHall May be you should try your solution on data.frame used in my answer. – MKR Mar 20 '18 at 22:00
1

You might prefer `sapply(..., simplify=FALSE)` or just `lapply`; the use of `as.data.frame` works here, but is silently up-converting any non-`character` into a string. – r2evans Mar 20 '18 at 22:05
@r2evans Valid points. Since my sample data.frmae was based on only `character` i preferred `sapply`:-) – MKR Mar 20 '18 at 22:09
I see :) Yes, it's a shame there is no `=~` operator that can be used on a data frame. Without it you have to convert in and out of a matrix to use the structure of the object, and the performance benefit of using the structure of the object is lost. – De Novo Mar 20 '18 at 22:10
1

@MKR: Thanks for your answer! Sorry for the late response, I'm trying to keep up with everybody while testing all the ideas and trying to understand the different parts of the codes! May I ask why you use sapply and not lapply in this case? – WoeIs Mar 20 '18 at 22:18
@WoeIs No worries!! `sapply` is a kind of wrapper around `lapply` which returns matrix, or vector. Since my sample data.frame had only `character` columns I could use `sapply` otherwise `lapply` would have been preferred. I tried using the approach you had initiated to find a solution using `str_detect`. – MKR Mar 20 '18 at 22:21
With `data.frame`s, I think the use of `sapply` or `lapply` should be influenced by the type of output intended. For instance, if the output is going to be another frame, then `lapply` is typically more appropriate. If the output is going to be a vector of one value per column (as in `ischr` in my answer), then `sapply` works great. If the output is intended to be a homogenous `matrix`, then `sapply` might work. However, if at any point you suspect that the lengths of each return without `sapply` might be different, then `sapply` no longer returns a `matrix`, which can be quite confusing. – r2evans Mar 20 '18 at 22:27
@MKR: Thanks for the explanation! I noticed too that when I tried using the sapply, my output would be a horrible mess since some of the column structures in my data frame weren't characters, but once I changed all columns to a character structure, everything went smoothly. I will keep the differences in mind, thank you! :) – WoeIs Mar 20 '18 at 22:32

Removing an element based on its content?

3 Answers3