Filtering rows based on "complex" strings in a column R/dplyr

Question

I am able to filter my dataset using the strings in a particular column, here's a sample dataset and how I did it.

ID = c(1, 2, 3, 4)
String = c("Y N No", "Y", "Y No", "Y N")
df = data.frame(ID, String)

The problem is - I want to only pick the IDs that have N in them - or - IDs that don't have N in them.

df_2 <- dpylr::filter(df, !grepl('N', String)) 

Output: [2]  [Y]

This will filter out the ID's with N, but it also removes ALL cases of N (including those that have 'No'. I'm new to R so I apologize if this is just me not understanding the syntax - but I cannot figure this out.

I could also try parsing out the string into individual columns, then selecting based on that - I need to do this anyway for later analysis. Below is the code that I use to achieve this.

df_2 <- df%>%mutate(String=gsub("\\b([A-Za-z]+)\\b","\\11",String),
          name=str_extract_all(String,"[A-Za-z]+"),
          value=str_extract_all(String,"\\d+"))%>%
unnest()%>%spread(name,value,fill=0)

This gives me

Output: 
ID<chr>  String<chr>   N<chr>  No <chr>   Y<chr>
1         Y1 N1 No1      1        1         1
2         Y1             0        0         1
3         Y1 No1         0        1         1
4         Y1 N1          1        0         1

This way I could just select my rows based on whether or not N is zero or one - however, R doesn't like when I do this and I do not understand why.

Thank you for any help you could offer.

EDIT: Here is a sample of my actual data. I might have over simplified in my question.

m/z             Column

241             C15 H22 O Na                
265             C15 H15 N5 
301             C16 H22 O4 Na 
335             C19 H20 O4 Na           
441             C26 H42 O4 Na

My goal is to filter out all of the N's in Column (They range from N, N1, N4, etc)

*"however, R doesn't like when I do this"* Can you be more specific. What "doesn't [R] like"? Is there an error? What error? I can reproduce your output `df_2` just fine -- without any errors. — Maurits Evers, Mar 01 '18 at 02:00
Does `df %>% filter(grepl("N\\s", String))` work? Here you match `"N"` plus a white-space. — Maurits Evers, Mar 01 '18 at 02:02
Possible duplicate of [Filtering row which contains a certain string using dplyr](https://stackoverflow.com/questions/22850026/filtering-row-which-contains-a-certain-string-using-dplyr) — Maurits Evers, Mar 01 '18 at 02:03
perhaps the patterns should be `"N(\\d|$)"` to allow "N" or " blah N" to be matched? — IRTFM, Mar 01 '18 at 03:04
Maurits - I think it's just because I do not understand the syntax. @Onyambu how would I do this to ignore N? Or multiple characters within the string? — Ragstock, Mar 01 '18 at 03:14
@Onyambu Hi - That indeed works for my sample set, but I might have over simplified in my question. See my edit for a small set of my real data - the strings are a bit more complicated and these functions won't work for it :/ — Ragstock, Mar 01 '18 at 03:26
@Onyambu I want to be able to either keep N(digits) or ignore N(digits) (No or Na doesn't have any relevance to me). — Ragstock, Mar 01 '18 at 03:43
@Ragstock To keep `"N"` plus `>0` digits: `df %>% filter(grepl("N\\d+", String))` — Maurits Evers, Mar 01 '18 at 03:50
@MauritsEvers How to make this work if there is no digit? Sorry when I said digits, I assume 0 is one as well - or the absence of it. — Ragstock, Mar 01 '18 at 03:53
@Ragstock Do you mean `df %>% filter(grepl("(N\\d+|N\\s)", String))`? This keeps entries that contain `"N"` plus *either* a digit *or* a white-space. Negate if you want to exclude. — Maurits Evers, Mar 01 '18 at 03:55
@MauritsEvers Wow okay this worked - amazing - I need to figure out this grepl function and that syntax. I can also do this with multiple characters ! Thank you - updating solution to original post. — Ragstock, Mar 01 '18 at 04:01
The syntax is called regular expression (often regexp or regex in short); [here](https://stat.ethz.ch/R-manual/R-devel/library/base/html/regex.html) is good starting point for reading:-) — Maurits Evers, Mar 01 '18 at 04:04
basically if you have something like this `x=c("N","Na","N1","2N","2Na","N2a")`and you don't need the N's what should be your expected results?? — Onyambu, Mar 01 '18 at 04:05
Don't add a solution as an edit to the question. Instead, post it as an answer. — Robert Columbia, Mar 01 '18 at 04:07

score 2 · Answer 1 · answered Mar 01 '18 at 04:16

ID = c(1, 2, 3, 4)
String = c("Y N No", "Y", "Y No", "Y N")
df = data.frame(ID, String)
df %>% filter(!grepl("(N\\d+|N\\s)", String))

Output: [Y] [Y No]

This answer by @MauritsEvers also works for the more complicated dataset in the second paragraph - where digits that may also come after N (like N2 or N10) will also be included in the argument. Remove "!" for including "N".

Amar · Answer 2 · 2018-03-01T04:21:29.293

0

I think your second approach is the way to go, especially if you going to split the columns for downstream analysis. It also (imo) meets "tidy" requirements. I also suggest standardising the String variable. Yes/Y, No/N are not acceptable.

The tidyr package has a two nice function for this separate and gather

library(dplyr)
library(tidyr)

ID = c(1, 2, 3, 4)
String = c("Y N No", "Y", "Y No", "Y N")
String <- gsub(pattern = "No", "N", String)
df = data.frame(ID, String)

#Separate the String var
df_sep <- separate(df, col = String, into = c("R1", "R2", "R3"), sep = " ", extra = "merge")
#gather the columns
df_gat <- gather(df_sep, Cols, StrValue, R1:R3, -ID) 
#filter
filter(df_gat, StrValue == "N" | StrValue != "N")

Here is my modified answer:

library(dplyr)
library(tidyr)
#Separate the String var
df_sep <- separate(df, col = Column, into = c("E1", "E2", "E3", "E4"), sep = " ", extra = "merge")
#gather the columns, long data format
gather(df_sep, Cols, Element, E1:E4, -m.z) %>% select(m.z, Element) -> df_gat
#filter
filter(df_gat, !grepl("^N$|N\\d", df_gat$Element))

It produces a long dataset that works well with the filter function. Your data previously was wide (kinda). I suggest changing the symbol of sodium to something else, you may run into trouble if Na (sodium) is converted to NA.

edited Mar 01 '18 at 04:21

answered Mar 01 '18 at 02:34

Amar

1,340
1
8
20

I'm sorry but my actual dataset is much more complicated than this one (N and No were supposed to denote different things - my bad!). Would I use the separate and gather function rather than the one I have in my second paragraph? I'm not sure I quite understand the output this gave me.. – Ragstock Mar 01 '18 at 03:19
Right... real data is AWLAYS helpful. First of all `Na` values are actual Not Available or does `Na` stand for something? Going to modify my answer to answer with new data – Amar Mar 01 '18 at 03:58
Na stands for Sodium - N stands for Nitrogen. I got knocked down for using real data (too complicated) in another post so I tried to simplify, but it looks like I went too far with it here. – Ragstock Mar 01 '18 at 04:04
Odd, I always prefer real data as long as it's simply to work with! It looks like you're working was mass spec data, so I can see what you're trying to do. – Amar Mar 01 '18 at 04:17

IRTFM · Answer 3 · 2018-03-01T04:27:49.847

0

You probably want to use sub to substitute "" for any pattern matching "N(\\d{1,3}|\\s|$)", meaning "N" followed by one of 1-3 digits or space or end of string.

I don't think you wnat to use filtering since as I understood the English description, you wanted to remove specific patterns from with character values. I was imagining that these were chemical symbols and that N was nitrogen and Na was sodium.

edited Mar 01 '18 at 04:27

answered Mar 01 '18 at 03:32

IRTFM

258,963
21
364
487

Not quite sure how to make this one work. Using the initial, simple example, I would do `df %>% sub("N(\\d{1,3}|\\s|$)", " ") ` ? – Ragstock Mar 01 '18 at 03:41
I'm not sure what the dply route would be, with base R just: `sub(patt="N(\\d{1,3}|\\s|$)", repl=" ", df$String) # [1] "Y No" "Y" "Y No" "Y "` – IRTFM Mar 01 '18 at 04:25

Filtering rows based on "complex" strings in a column R/dplyr

3 Answers3